从数据框列返回不匹配的单词

时间:2017-10-22 06:50:41

标签: pandas

我有一个数据框,我将客户名称保存在2列中。我需要省略两列中的常用单词并返回不匹配的单词。

from io import StringIO

import pandas as pd

u_cols = ['page_id','web_id']
audit_trail = StringIO('''
shantanu prabhakar oak | santanu prabhakar oak
amar atmaram patil | amar atmaram patel 
''')

df11 = pd.read_csv(audit_trail, sep="|", names = u_cols  )

预期结果:

santanu
patel

我尝试过:

set(df11['page_id']) - set(df11['web_id'])

{'amar atmaram patil ', 'shantanu prabhakar oak '}

更新

如果返回带有更正的字典,那将会很棒:

{'shantanu':'santanu','patil':'patel'}

我以前没有问过,因为我认为在熊猫里不可能。

2 个答案:

答案 0 :(得分:2)

使用

In [5128]: df
Out[5128]:
                  page_id                 web_id
0  shantanu prabhakar oak  santanu prabhakar oak
1      amar atmaram patil     amar atmaram patel

In [5129]: df.apply(lambda x: set(x.web_id.split()) - set(x.page_id.split()), axis=1)
Out[5129]:
0    {santanu}
1      {patel}
dtype: object

更新

In [5134]: df.apply(lambda x: {b:a for a, b in zip(x.web_id.split(), x.page_id.split()) 
                               if a!=b}, axis=1)
Out[5134]:
0    {u'shantanu': u'santanu'}
1         {u'patil': u'patel'}
dtype: object

或平坦的字典

In [5141]: vals = df.apply(lambda x: {b:a for a, b in zip(x.web_id.split(),
                                                          x.page_id.split())
                                      if a!=b}, axis=1)

In [5142]: {k:v for d in vals.values for k, v in d.items()}
Out[5142]: {'patil': 'patel', 'shantanu': 'santanu'}

答案 1 :(得分:2)

使用pd.DataFrame.applymappd.DataFrame.diff

df11.applymap(lambda x: set(x.split())).diff(axis=1).iloc[:, -1]

0    {santanu}
1      {patel}
Name: web_id, dtype: object

或者,创建一个空格分隔的字符串

df11.applymap(lambda x: set(x.split())).diff(axis=1).iloc[:, -1].apply(' '.join)

0    santanu
1      patel
Name: web_id, dtype: object