如何删除熊猫数据框列中与另一列中的单词匹配的单词

时间:2019-10-22 09:20:25

标签: python pandas replace

我试图删除在另一列中存在(匹配)的pandas数据框列中的一部分字符串,这些值用逗号分隔,并且可能是一个或多个。我想用字符串的其余部分创建一个新列。下面是可复制的示例和到目前为止的代码:

import pandas as pd

df = pd.DataFrame({
    'Country' : ['Germany, France, Brazil, India, Russia','Russia, France, 
   Jamaica, India, China',
                 'Germany, Russia, Jamaica','Italy, Jamaica'],
    'Exclude' : ['France, Brazil','India, Russia','Jamaica','Italy']})

print(df)

打印的数据框:

                                  Country         Exclude
0  Germany, France, Brazil, India, Russia  France, Brazil
1   Russia, France, Jamaica, India, China   India, Russia
2                Germany, Russia, Jamaica         Jamaica
3                          Italy, Jamaica           Italy

我要创建“输出”列,该列将具有“排除”列中不存在的国家/地区的名称。所以我尝试了:

df['Output'] = df['Country'].replace(to_replace=r'\b'+df['Exclude']+r'\b', 
value='',regex=True)

所需的输出:

Country                                    Exclude              Output
0  Germany, France, Brazil, India, Russia  France, Brazil       Germany, India, Russia
1  Russia, France, Jamaica, India, China   India, Russia        France, Jamaica, China
2  Germany, Russia, Jamaica                Jamaica              Germany, Russia 
3  Italy, Jamaica                          Italy                Jamaica

完成一半工作,就像当“国家/地区”中的“排除”列中的文本完全匹配时匹配,但是当序列与“排除”列中的序列不同时不起作用。例如,它将不适用于第二行。 在发布问题之前,我花了很多时间并尝试了其他几种方法,我在SO上发现了类似的问题,但在这种情况下它们无济于事。 请帮忙。

1 个答案:

答案 0 :(得分:2)

set difference中每行使用apply分割值:

f=lambda x: ', '.join(set(x['Country'].split(', ')).difference(set(x['Exclude'].split(', '))))
df['Out'] = df.apply(f, axis=1)

或使用zip进行列表理解:

df['Out'] = ([', '.join(set(a.split(', ')).difference(set(b.split(', ')))) 
                  for a, b in zip(df['Country'], df['Exclude'])])

print (df)
                                  Country         Exclude  \
0  Germany, France, Brazil, India, Russia  France, Brazil   
1   Russia, France, Jamaica, India, China   India, Russia   
2                Germany, Russia, Jamaica         Jamaica   
3                          Italy, Jamaica           Italy   

                      Out  
0  Germany, India, Russia  
1  China, France, Jamaica  
2         Germany, Russia  
3                 Jamaica  

如果订单很重要:

df['Out'] = [', '.join(x for x in a.split(', ') if x not in set(b.split(', '))) 
                    for a, b in zip(df['Country'], df['Exclude'])]
print (df)
                                  Country         Exclude  \
0  Germany, France, Brazil, India, Russia  France, Brazil   
1   Russia, France, Jamaica, India, China   India, Russia   
2                Germany, Russia, Jamaica         Jamaica   
3                          Italy, Jamaica           Italy   

                      Out  
0  Germany, India, Russia  
1  France, Jamaica, China  
2         Germany, Russia  
3                 Jamaica