连接两个数据框并根据列值删除重复的行

时间:2020-01-05 15:50:11

标签: python pandas

我有两个数据框。

df1

    Name Symbol         ID
0    Jay    N/A    372Y105
1    Ray    N/A    4446100
2   Faye    N/A    484MAA4
3   Maye    N/A    504W308
4    Kay    N/A    782L107
5   Trey    FFF    782L111

df2

    Name Symbol         ID
0    Jay    AAA    372Y105
1   Faye    CCC    484MAA4
2    Kay    EEE    782L107

如果IDdf1之间的匹配df2,我想用{{1}中的symbol替换df1中的symbol }},因此输出如下所示:

df2

听起来我应该首先连接两个数据帧,然后以某种方式删除重复项,例如

    Name Symbol         ID
0    Jay    AAA    372Y105
1    Ray    N/A    4446100
2   Faye    CCC    484MAA4
3   Maye    N/A    504W308
4    Kay    EEE    782L107
5   Trey    FFF    782L111

除了只保留第一个或最后一个重复项外,我还想删除df3 = pd.concat([df1, df2]) df3 = df3.drop_duplicates(subset='ID', keep='last') = symbol处的那些重复项。

1 个答案:

答案 0 :(得分:1)

首先将merge与左连接一起使用,然后将Symbol列中的缺失值替换为Symbol_列:

print (df1.merge(df2, on=['Name','ID'], how='left', suffixes=('', '_')))
   Name Symbol       ID Symbol_
0   Jay    NaN  372Y105     AAA
1   Ray    NaN  4446100     NaN
2  Faye    NaN  484MAA4     CCC
3  Maye    NaN  504W308     NaN
4   Kay    NaN  782L107     EEE
5  Trey    FFF  782L111     NaN

df = (df1.merge(df2, on=['Name','ID'], how='left', suffixes=('', '_'))
         .assign(Symbol = lambda x: x['Symbol'].fillna(x.pop('Symbol_'))))
print (df)
   Name Symbol       ID
0   Jay    AAA  372Y105
1   Ray    NaN  4446100
2  Faye    CCC  484MAA4
3  Maye    NaN  504W308
4   Kay    EEE  782L107
5  Trey    FFF  782L111

使用DataFrame.update的另一种解决方案:

df1 = df1.set_index(['Name','ID'])
df2 = df2.set_index(['Name','ID'])
df1.update(df2)
df1 = df1.reset_index()
print (df1)
   Name       ID Symbol
0   Jay  372Y105    AAA
1   Ray  4446100    NaN
2  Faye  484MAA4    CCC
3  Maye  504W308    NaN
4   Kay  782L107    EEE
5  Trey  782L111    FFF