我有2个数据框,在其中我根据列(tld)找到了常见的匹配项,如果找到匹配项(源和目标中的列之间),我将列(uuid)的值从源复制到了目标数据帧。
现在,我还需要检查其他列是否匹配。 (company_name)
数据框1:来源
uuid website company_name tld
0 11 www.facebook.com facebook facebook.com
1 22 www.yahoo.com yahoo inc yahoo.com
2 33 www.google.com Google google.com
3 44 www.cisco.com Cisco cisco.com
数据框2:目标
id website company_name tld match uuid
0 a www.facebook.com facebook facebook.com False NaN
1 b www.y.com Yahoo Inc y.com False NaN
2 c www.g.com Google g.com False NaN
3 d www.g.com Google Inc g.com False NaN
4 e www.facebook.com Facebook Inc facebook.com False NaN
找到matches:
destination.loc[destination.tld.isin(source.tld),'match'] = True
destination = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
上面的代码将UUID列从源复制到UUID列 在目标数据框中。
id website company_name tld match uuid
0 a www.facebook.com facebook facebook.com True 11
1 b www.y.com YahooInc y.com False NaN
2 c www.g.com Google g.com False NaN
3 d www.g.com GoogleInc g.com False NaN
4 e www.facebook.com FacebookInc facebook.com True 11
现在,我还需要检查company_name是否匹配,以获取类似这样的信息:
id website company_name tld match uuid
0 a www.facebook.com facebook facebook.com True 11
1 b www.y.com YahooInc y.com False NaN
2 c www.g.com Google g.com True 33
3 d www.g.com GoogleInc g.com False NaN
4 e www.facebook.com FacebookInc facebook.com True 11
当我尝试添加时:
destination.loc[destination.company_name.isin(source.company_name), 'match'] = True
destination = destination.merge(source[['company_name', 'uuid']], on='company_name', how='left')
我得到一个重复的uuid列:uuid_x和uuid_y
id website company_name tld match uuid_x uuid_y
0 a www.facebook.com facebook facebook.com True 11 11
1 b www.y.com Yahoo Inc y.com False NaN NaN
2 c www.g.com Google g.com True NaN 33
3 d www.g.com Google Inc g.com False NaN NaN
4 e www.facebook.com Facebook Inc facebook.com True 11 NaN
最终代码
destination.loc[destination.tld.isin(source.tld),'match'] = True
destination = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
destination.loc[destination.company_name.isin(source.company_name), 'match'] = True
destination = destination.merge(source[['company_name', 'uuid']], on='company_name', how='left')
答案 0 :(得分:1)
我认为需要使用match
的列m1
链布尔掩码m2
和匹配的值combine_first
的新列:
m1 = destination.tld.isin(source.tld)
m2 = destination.company_name.isin(source.company_name)
destination['match'] = m1 | m2
destination1 = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
destination = destination.merge(source[['company_name','uuid']],on='company_name',how='left')
destination['uuid'] = destination['uuid'].combine_first(destination1['uuid'])
print (destination)
id website company_name tld match uuid
0 a www.facebook.com facebook facebook.com True 11.0
1 b www.y.com Yahoo Inc y.com False NaN
2 c www.g.com Google g.com True 33.0
3 d www.g.com Google Inc g.com False NaN
4 e www.facebook.com Facebook Inc facebook.com True 11.0