在两个列之间进行匹配并从pandas中的另一个值中获取值

时间:2016-02-07 10:24:06

标签: python pandas

首先,如果这个问题已经得到明确解答,我很抱歉。我看到有非常相似的答案,但我无法使用它。所以我的问题是在两组(UsedFName==FName and UsedLName==LName)列之间进行匹配,然后在完全匹配时使用'id'列中的ID填充Usedid列。

所以这是一个玩具数据集

>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain     NaN
1         Nadia      Alam  2002    Tanvir   Hossain     NaN
2           Pia     Naime  2003    Tanvir   Hossain     NaN
3        Koethe  Talukdar  2004    Koethe  Talukdar     NaN
4        Manual   Hausman  2005    Koethe  Talukdar     NaN
5   Constantine      Pape   NaN       Max     Weber     NaN
6       Andreas       Kai  2006       Max     Weber     NaN
7           Max     Weber  2007    Manual   Hausman     NaN
8         Weber       Mac  2008    Manual   Hausman     NaN
9         Plank      Ingo  2009    Manual   Hausman     NaN
10       Tanvir   Hossain  2001       Pia     Naime     NaN
11        Weber       Mac  2008       Pia     Naime     NaN
12       Manual   Hausman  2005    Tanvir   Hossain     NaN
13          Max     Weber  2007    Tanvir   Hossain     NaN
14        Nadia      Alam  2002    Manual   Hausman     NaN
15        Weber       Mac  2008    Manual   Hausman     NaN
16          Pia     Naime  2003    Koethe  Talukdar     NaN
17          Pia     Naime  2003    Koethe  Talukdar     NaN
18  Constantine      Pape   NaN    Koethe  Talukdar     NaN
19       Koethe  Talukdar  2004    Koethe  Talukdar     NaN
20       Koethe  Talukdar  2005    Manual   Hausman     NaN
21          NaN       NaN   NaN    Manual   Hausman     NaN
22          NaN       NaN   NaN    Manual   Hausman     NaN
23          NaN       NaN   NaN    Manual   Hausman     NaN
24          NaN       NaN   NaN    Manual   Hausman     NaN
25          NaN       NaN   NaN    Manual   Hausman     NaN
26          NaN       NaN   NaN    Manual   Hausman     NaN
27          NaN       NaN   NaN    Manual   Hausman     NaN

这是输出

>>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain    2001
1         Nadia      Alam  2002    Tanvir   Hossain    2001
2           Pia     Naime  2003    Tanvir   Hossain    2001
3        Koethe  Talukdar  2004    Koethe  Talukdar    2005
4        Manual   Hausman  2005    Koethe  Talukdar    2005
5   Constantine      Pape   NaN       Max     Weber    2007
6       Andreas       Kai  2006       Max     Weber    2007
7           Max     Weber  2007    Manual   Hausman    2005
8         Weber       Mac  2008    Manual   Hausman    2005
9         Plank      Ingo  2009    Manual   Hausman    2005
10       Tanvir   Hossain  2001       Pia     Naime    2003
11        Weber       Mac  2008       Pia     Naime    2003
12       Manual   Hausman  2005    Tanvir   Hossain    2001
13          Max     Weber  2007    Tanvir   Hossain    2001
14        Nadia      Alam  2002    Manual   Hausman    2005
15        Weber       Mac  2008    Manual   Hausman    2005
16          Pia     Naime  2003    Koethe  Talukdar    2005
17          Pia     Naime  2003    Koethe  Talukdar    2005
18  Constantine      Pape   NaN    Koethe  Talukdar    2005
19       Koethe  Talukdar  2004    Koethe  Talukdar    2005
20       Koethe  Talukdar  2005    Manual   Hausman    2005
21          NaN       NaN   NaN    Manual   Hausman    2005
22          NaN       NaN   NaN    Manual   Hausman    2005
23          NaN       NaN   NaN    Manual   Hausman    2005
24          NaN       NaN   NaN    Manual   Hausman    2005
25          NaN       NaN   NaN    Manual   Hausman    2005
26          NaN       NaN   NaN    Manual   Hausman    2005
27          NaN       NaN   NaN    Manual   Hausman    2005

实际上我可以使用嵌套for循环来完成它,这里是代码:

for i in df['UsedFName'].index:
    for j in df['FName'].index:
        if df['UsedFName'][i]==df['FName'][j] & df['UsedLName'][i]==df['LName'][j]:
            df.ix[i,'Usedid'] = df.ix[j,'id']

但是在这里使用嵌套for循环在计算上非常昂贵。我有一个庞大的数据集。是否可以在没有嵌套循环的情况下使用它?我可以在这里使用简单的Pythonic方式或Pandas / Numpy方法吗?

非常感谢您的帮助...期待学习Python。

2 个答案:

答案 0 :(得分:2)

您将不得不考虑更多pandaesque可能会添加散列逻辑,但这符合您的预期输出并且效率更高,您只需要使用匹配UsedFName和{{1}的ID } {} {}} {}} "UsedLName"FName

LNames

输出:

import pandas as pd

# Create dict where each  key is tuple -> (FName,Lname)
# with the corresponding id as the value
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))

# Do a lookup in d using a tuple -> (UsedFName, UsedLName) to get the correct id for each pairing
df["Usedid"] = [d[(f, l)] for f,l in zip(df["UsedFName"], df["UsedLName"])]
print(df["Usedid"])

如果可能未使用某些名称,则可以使用dict.get。

使用默认值

这比建议的groupby快:

0     2001
1     2001
2     2001
3     2005
4     2005
5     2007
6     2007
7     2005
8     2005
9     2005
10    2003
11    2003
12    2001
13    2001
14    2005
15    2005
16    2005
17    2005
18    2005
19    2005
20    2005
21    2005
22    2005
23    2005
24    2005
25    2005
26    2005
27    2005
Name: Useid, dtype: float64

答案 1 :(得分:1)

解决方案

这有效:

ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)

说明

首先,我们找到FNameLName的ID:

ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])

他们看起来像这样:

FName        LName   
Andreas      Kai         2006
Constantine  Pape         NaN
Koethe       Talukdar    2005
Manual       Hausman     2005
Max          Weber       2007
Nadia        Alam        2002
Pia          Naime       2003
Plank        Ingo        2009
Tanvir       Hossain     2001
Weber        Mac         2008
Name: id, dtype: float64

此处groupby()分为两列,即第一个和最后一个名称。到"看"什么,你需要"做"用它的东西。让我们将每个组的所有ID转换为列表:

>>> df.groupby(['FName', 'LName']).id.apply(list)

FName        LName   
Andreas      Kai                         [2006.0]
Constantine  Pape                      [nan, nan]
Koethe       Talukdar    [2004.0, 2004.0, 2005.0]
Manual       Hausman             [2005.0, 2005.0]
Max          Weber               [2007.0, 2007.0]
Nadia        Alam                [2002.0, 2002.0]
Pia          Naime       [2003.0, 2003.0, 2003.0]
Plank        Ingo                        [2009.0]
Tanvir       Hossain             [2001.0, 2001.0]
Weber        Mac         [2008.0, 2008.0, 2008.0]
Name: id, dtype: object

由于我们有NaN,因此数据类型为float

我们只想要每组的最后一个ID。因此,我们使用list()函数代替lambda

lambda x: list(x)[-1]

在第二步中,我们使用ids

df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)

我们将函数应用于逐行(axis=1)的数据帧。这里x是一条线。我们使用UsedFNameUsedLName列中的值来获取相应的ID,并使用df.Usedid =将其分配给结果列。

输出

df看起来像这样:

          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain    2001
1         Nadia      Alam  2002    Tanvir   Hossain    2001
2           Pia     Naime  2003    Tanvir   Hossain    2001
3        Koethe  Talukdar  2004    Koethe  Talukdar    2005
4        Manual   Hausman  2005    Koethe  Talukdar    2005
5   Constantine      Pape   NaN       Max     Weber    2007
6       Andreas       Kai  2006       Max     Weber    2007
7           Max     Weber  2007    Manual   Hausman    2005
8         Weber       Mac  2008    Manual   Hausman    2005
9         Plank      Ingo  2009    Manual   Hausman    2005
10       Tanvir   Hossain  2001       Pia     Naime    2003
11        Weber       Mac  2008       Pia     Naime    2003
12       Manual   Hausman  2005    Tanvir   Hossain    2001
13          Max     Weber  2007    Tanvir   Hossain    2001
14        Nadia      Alam  2002    Manual   Hausman    2005
15        Weber       Mac  2008    Manual   Hausman    2005
16          Pia     Naime  2003    Koethe  Talukdar    2005
17          Pia     Naime  2003    Koethe  Talukdar    2005
18  Constantine      Pape   NaN    Koethe  Talukdar    2005
19       Koethe  Talukdar  2004    Koethe  Talukdar    2005
20       Koethe  Talukdar  2005    Manual   Hausman    2005
21          NaN       NaN   NaN    Manual   Hausman    2005
22          NaN       NaN   NaN    Manual   Hausman    2005
23          NaN       NaN   NaN    Manual   Hausman    2005
24          NaN       NaN   NaN    Manual   Hausman    2005
25          NaN       NaN   NaN    Manual   Hausman    2005
26          NaN       NaN   NaN    Manual   Hausman    2005
27          NaN       NaN   NaN    Manual   Hausman    2005