首先,如果这个问题已经得到明确解答,我很抱歉。我看到有非常相似的答案,但我无法使用它。所以我的问题是在两组(UsedFName==FName and UsedLName==LName)
列之间进行匹配,然后在完全匹配时使用'id'列中的ID填充Usedid
列。
所以这是一个玩具数据集
>> df
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain NaN
1 Nadia Alam 2002 Tanvir Hossain NaN
2 Pia Naime 2003 Tanvir Hossain NaN
3 Koethe Talukdar 2004 Koethe Talukdar NaN
4 Manual Hausman 2005 Koethe Talukdar NaN
5 Constantine Pape NaN Max Weber NaN
6 Andreas Kai 2006 Max Weber NaN
7 Max Weber 2007 Manual Hausman NaN
8 Weber Mac 2008 Manual Hausman NaN
9 Plank Ingo 2009 Manual Hausman NaN
10 Tanvir Hossain 2001 Pia Naime NaN
11 Weber Mac 2008 Pia Naime NaN
12 Manual Hausman 2005 Tanvir Hossain NaN
13 Max Weber 2007 Tanvir Hossain NaN
14 Nadia Alam 2002 Manual Hausman NaN
15 Weber Mac 2008 Manual Hausman NaN
16 Pia Naime 2003 Koethe Talukdar NaN
17 Pia Naime 2003 Koethe Talukdar NaN
18 Constantine Pape NaN Koethe Talukdar NaN
19 Koethe Talukdar 2004 Koethe Talukdar NaN
20 Koethe Talukdar 2005 Manual Hausman NaN
21 NaN NaN NaN Manual Hausman NaN
22 NaN NaN NaN Manual Hausman NaN
23 NaN NaN NaN Manual Hausman NaN
24 NaN NaN NaN Manual Hausman NaN
25 NaN NaN NaN Manual Hausman NaN
26 NaN NaN NaN Manual Hausman NaN
27 NaN NaN NaN Manual Hausman NaN
这是输出
>>> df
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain 2001
1 Nadia Alam 2002 Tanvir Hossain 2001
2 Pia Naime 2003 Tanvir Hossain 2001
3 Koethe Talukdar 2004 Koethe Talukdar 2005
4 Manual Hausman 2005 Koethe Talukdar 2005
5 Constantine Pape NaN Max Weber 2007
6 Andreas Kai 2006 Max Weber 2007
7 Max Weber 2007 Manual Hausman 2005
8 Weber Mac 2008 Manual Hausman 2005
9 Plank Ingo 2009 Manual Hausman 2005
10 Tanvir Hossain 2001 Pia Naime 2003
11 Weber Mac 2008 Pia Naime 2003
12 Manual Hausman 2005 Tanvir Hossain 2001
13 Max Weber 2007 Tanvir Hossain 2001
14 Nadia Alam 2002 Manual Hausman 2005
15 Weber Mac 2008 Manual Hausman 2005
16 Pia Naime 2003 Koethe Talukdar 2005
17 Pia Naime 2003 Koethe Talukdar 2005
18 Constantine Pape NaN Koethe Talukdar 2005
19 Koethe Talukdar 2004 Koethe Talukdar 2005
20 Koethe Talukdar 2005 Manual Hausman 2005
21 NaN NaN NaN Manual Hausman 2005
22 NaN NaN NaN Manual Hausman 2005
23 NaN NaN NaN Manual Hausman 2005
24 NaN NaN NaN Manual Hausman 2005
25 NaN NaN NaN Manual Hausman 2005
26 NaN NaN NaN Manual Hausman 2005
27 NaN NaN NaN Manual Hausman 2005
实际上我可以使用嵌套for循环来完成它,这里是代码:
for i in df['UsedFName'].index:
for j in df['FName'].index:
if df['UsedFName'][i]==df['FName'][j] & df['UsedLName'][i]==df['LName'][j]:
df.ix[i,'Usedid'] = df.ix[j,'id']
但是在这里使用嵌套for循环在计算上非常昂贵。我有一个庞大的数据集。是否可以在没有嵌套循环的情况下使用它?我可以在这里使用简单的Pythonic方式或Pandas / Numpy方法吗?
非常感谢您的帮助...期待学习Python。
答案 0 :(得分:2)
您将不得不考虑更多pandaesque可能会添加散列逻辑,但这符合您的预期输出并且效率更高,您只需要使用匹配UsedFName
和{{1}的ID } {} {}} {}} "UsedLName"
和FName
:
LNames
输出:
import pandas as pd
# Create dict where each key is tuple -> (FName,Lname)
# with the corresponding id as the value
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))
# Do a lookup in d using a tuple -> (UsedFName, UsedLName) to get the correct id for each pairing
df["Usedid"] = [d[(f, l)] for f,l in zip(df["UsedFName"], df["UsedLName"])]
print(df["Usedid"])
如果可能未使用某些名称,则可以使用dict.get。
使用默认值这比建议的groupby快:
0 2001
1 2001
2 2001
3 2005
4 2005
5 2007
6 2007
7 2005
8 2005
9 2005
10 2003
11 2003
12 2001
13 2001
14 2005
15 2005
16 2005
17 2005
18 2005
19 2005
20 2005
21 2005
22 2005
23 2005
24 2005
25 2005
26 2005
27 2005
Name: Useid, dtype: float64
答案 1 :(得分:1)
这有效:
ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)
首先,我们找到FName
和LName
的ID:
ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])
他们看起来像这样:
FName LName
Andreas Kai 2006
Constantine Pape NaN
Koethe Talukdar 2005
Manual Hausman 2005
Max Weber 2007
Nadia Alam 2002
Pia Naime 2003
Plank Ingo 2009
Tanvir Hossain 2001
Weber Mac 2008
Name: id, dtype: float64
此处groupby()
分为两列,即第一个和最后一个名称。到"看"什么,你需要"做"用它的东西。让我们将每个组的所有ID转换为列表:
>>> df.groupby(['FName', 'LName']).id.apply(list)
FName LName
Andreas Kai [2006.0]
Constantine Pape [nan, nan]
Koethe Talukdar [2004.0, 2004.0, 2005.0]
Manual Hausman [2005.0, 2005.0]
Max Weber [2007.0, 2007.0]
Nadia Alam [2002.0, 2002.0]
Pia Naime [2003.0, 2003.0, 2003.0]
Plank Ingo [2009.0]
Tanvir Hossain [2001.0, 2001.0]
Weber Mac [2008.0, 2008.0, 2008.0]
Name: id, dtype: object
由于我们有NaN
,因此数据类型为float
。
我们只想要每组的最后一个ID。因此,我们使用list()
函数代替lambda
:
lambda x: list(x)[-1]
在第二步中,我们使用ids
:
df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)
我们将函数应用于逐行(axis=1
)的数据帧。这里x
是一条线。我们使用UsedFName
和UsedLName
列中的值来获取相应的ID,并使用df.Usedid =
将其分配给结果列。
df
看起来像这样:
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain 2001
1 Nadia Alam 2002 Tanvir Hossain 2001
2 Pia Naime 2003 Tanvir Hossain 2001
3 Koethe Talukdar 2004 Koethe Talukdar 2005
4 Manual Hausman 2005 Koethe Talukdar 2005
5 Constantine Pape NaN Max Weber 2007
6 Andreas Kai 2006 Max Weber 2007
7 Max Weber 2007 Manual Hausman 2005
8 Weber Mac 2008 Manual Hausman 2005
9 Plank Ingo 2009 Manual Hausman 2005
10 Tanvir Hossain 2001 Pia Naime 2003
11 Weber Mac 2008 Pia Naime 2003
12 Manual Hausman 2005 Tanvir Hossain 2001
13 Max Weber 2007 Tanvir Hossain 2001
14 Nadia Alam 2002 Manual Hausman 2005
15 Weber Mac 2008 Manual Hausman 2005
16 Pia Naime 2003 Koethe Talukdar 2005
17 Pia Naime 2003 Koethe Talukdar 2005
18 Constantine Pape NaN Koethe Talukdar 2005
19 Koethe Talukdar 2004 Koethe Talukdar 2005
20 Koethe Talukdar 2005 Manual Hausman 2005
21 NaN NaN NaN Manual Hausman 2005
22 NaN NaN NaN Manual Hausman 2005
23 NaN NaN NaN Manual Hausman 2005
24 NaN NaN NaN Manual Hausman 2005
25 NaN NaN NaN Manual Hausman 2005
26 NaN NaN NaN Manual Hausman 2005
27 NaN NaN NaN Manual Hausman 2005