熊猫多次合并生成_x和_y列

时间:2018-07-08 04:24:41

标签: python pandas dataframe

我有2个数据框,在其中我根据列(tld)找到了常见的匹配项,如果找到匹配项(源和目标中的列之间),我将列(uuid)的值从源复制到了目标数据帧。

现在,我还需要检查其他列是否匹配。 (company_name)

数据框1:来源

   uuid           website company_name           tld
0     11  www.facebook.com     facebook  facebook.com
1     22     www.yahoo.com    yahoo inc     yahoo.com
2     33    www.google.com       Google    google.com
3     44     www.cisco.com        Cisco     cisco.com

数据框2:目标

  id  website           company_name           tld  match uuid
0  a  www.facebook.com      facebook  facebook.com  False  NaN
1  b         www.y.com     Yahoo Inc         y.com  False  NaN
2  c         www.g.com        Google         g.com  False  NaN
3  d         www.g.com    Google Inc         g.com  False  NaN
4  e  www.facebook.com  Facebook Inc  facebook.com  False  NaN

找到matches

destination.loc[destination.tld.isin(source.tld),'match'] = True
destination = destination.merge(source[['tld', 'uuid']], on='tld', how='left')

上面的代码将UUID列从源复制到UUID列 在目标数据框中。

 id           website company_name           tld  match  uuid
0  a  www.facebook.com     facebook  facebook.com   True  11
1  b         www.y.com     YahooInc         y.com  False  NaN
2  c         www.g.com       Google         g.com  False  NaN
3  d         www.g.com    GoogleInc         g.com  False  NaN
4  e  www.facebook.com  FacebookInc  facebook.com   True  11

现在,我还需要检查company_name是否匹配,以获取类似这样的信息:

 id           website company_name           tld  match  uuid
0  a  www.facebook.com     facebook  facebook.com   True  11
1  b         www.y.com     YahooInc         y.com  False  NaN
2  c         www.g.com       Google         g.com  True   33
3  d         www.g.com    GoogleInc         g.com  False  NaN
4  e  www.facebook.com  FacebookInc  facebook.com   True  11

当我尝试添加时:

destination.loc[destination.company_name.isin(source.company_name), 'match'] = True
destination = destination.merge(source[['company_name', 'uuid']], on='company_name', how='left')

我得到一个重复的uuid列:uuid_x和uuid_y

id           website  company_name           tld  match uuid_x uuid_y
0  a  www.facebook.com      facebook  facebook.com   True     11     11
1  b         www.y.com     Yahoo Inc         y.com  False    NaN    NaN
2  c         www.g.com        Google         g.com   True    NaN     33
3  d         www.g.com    Google Inc         g.com  False    NaN    NaN
4  e  www.facebook.com  Facebook Inc  facebook.com   True     11    NaN

最终代码

destination.loc[destination.tld.isin(source.tld),'match'] = True
destination = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
destination.loc[destination.company_name.isin(source.company_name), 'match'] = True
destination = destination.merge(source[['company_name', 'uuid']], on='company_name', how='left')

1 个答案:

答案 0 :(得分:1)

我认为需要使用match的列m1链布尔掩码m2和匹配的值combine_first的新列:

m1 = destination.tld.isin(source.tld)
m2 = destination.company_name.isin(source.company_name)
destination['match'] = m1 | m2
destination1 = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
destination = destination.merge(source[['company_name','uuid']],on='company_name',how='left')

destination['uuid'] = destination['uuid'].combine_first(destination1['uuid'])
print (destination)
  id           website  company_name           tld  match  uuid
0  a  www.facebook.com      facebook  facebook.com   True  11.0
1  b         www.y.com     Yahoo Inc         y.com  False   NaN
2  c         www.g.com        Google         g.com   True  33.0
3  d         www.g.com    Google Inc         g.com  False   NaN
4  e  www.facebook.com  Facebook Inc  facebook.com   True  11.0