我有一个数据框,我将客户名称保存在2列中。我需要省略两列中的常用单词并返回不匹配的单词。
from io import StringIO
import pandas as pd
u_cols = ['page_id','web_id']
audit_trail = StringIO('''
shantanu prabhakar oak | santanu prabhakar oak
amar atmaram patil | amar atmaram patel
''')
df11 = pd.read_csv(audit_trail, sep="|", names = u_cols )
预期结果:
santanu
patel
我尝试过:
set(df11['page_id']) - set(df11['web_id'])
{'amar atmaram patil ', 'shantanu prabhakar oak '}
更新
如果返回带有更正的字典,那将会很棒:
{'shantanu':'santanu','patil':'patel'}
我以前没有问过,因为我认为在熊猫里不可能。
答案 0 :(得分:2)
使用
In [5128]: df
Out[5128]:
page_id web_id
0 shantanu prabhakar oak santanu prabhakar oak
1 amar atmaram patil amar atmaram patel
In [5129]: df.apply(lambda x: set(x.web_id.split()) - set(x.page_id.split()), axis=1)
Out[5129]:
0 {santanu}
1 {patel}
dtype: object
更新
In [5134]: df.apply(lambda x: {b:a for a, b in zip(x.web_id.split(), x.page_id.split())
if a!=b}, axis=1)
Out[5134]:
0 {u'shantanu': u'santanu'}
1 {u'patil': u'patel'}
dtype: object
或平坦的字典
In [5141]: vals = df.apply(lambda x: {b:a for a, b in zip(x.web_id.split(),
x.page_id.split())
if a!=b}, axis=1)
In [5142]: {k:v for d in vals.values for k, v in d.items()}
Out[5142]: {'patil': 'patel', 'shantanu': 'santanu'}
答案 1 :(得分:2)
使用pd.DataFrame.applymap
和pd.DataFrame.diff
df11.applymap(lambda x: set(x.split())).diff(axis=1).iloc[:, -1]
0 {santanu}
1 {patel}
Name: web_id, dtype: object
或者,创建一个空格分隔的字符串
df11.applymap(lambda x: set(x.split())).diff(axis=1).iloc[:, -1].apply(' '.join)
0 santanu
1 patel
Name: web_id, dtype: object