删除熊猫中两个数据框的列中不常见的单词

时间:2019-05-07 05:27:02

标签: python pandas dataframe

我有两个数据帧:df1和df2。

df1看起来像这样:

id   text
1    I love this car
2    I hate this car
3    Cars are life
4    Bikers are also good

df2看起来像这样:

id   text
1    I love this supercar
2    I hate cars
3    Cars are love
4    Bikers are nice

现在,我只想保留df1df2中的那些单词。

car一词在df1中,但不在df2中,因此我想将其删除。

life一词在df1中,但不在df2中,因此我想将其删除。

also一词在df1中,但不在df2中,因此我想将其删除。

good一词在df1中,但不在df2中,因此我想将其删除。

supercar一词在df2中,但不在df1中,因此我想将其删除。

nice一词在df2中,但不在df1中,因此我想将其删除。

df1的预期输出:

id   text
1    I love this
2    I hate this
3    Cars are
4    Bikers are

df2的预期输出

id   text
1    I love this
2    I hate cars
3    Cars are love
4    Bikers are

1 个答案:

答案 0 :(得分:2)

在两列中创建单词交集,然后删除不匹配的值:

a = set([y for x in df1['text'] for y in x.split()])
b = set([y for x in df2['text'] for y in x.split()])
c = a & b
print (c)
{'hate', 'are', 'Bikers', 'this', 'love', 'I', 'Cars'}

df1['text'] = df1['text'].apply(lambda x: ' '.join(y for y in x.split() if y in c))
df2['text'] = df2['text'].apply(lambda x: ' '.join(y for y in x.split() if y in c))
print (df1)
   id         text
0   1  I love this
1   2  I hate this
2   3     Cars are
3   4   Bikers are

print (df2)
   id           text
0   1    I love this
1   2         I hate
2   3  Cars are love
3   4     Bikers are