我在数据帧中有两个字符串列,我想从A中减去B中的常用词。
A -> Stack Overlflow is great
B -> stack great
A-B -> overflow is
我尝试了以下代码。但是,只有在B列只有一个字的情况下,它才起作用。
df['A-B'] = [' '.join(set(a.split())-set(b.split())) for a, b in zip(df['A'], df['B'])]
我有什么可以改变的,以便当B有多个单词时可以使用?
答案 0 :(得分:2)
您可以使用np.setdiff1d()
:
df['A-B']=df.apply(lambda x: ' '.join(np.setdiff1d(x['A'].lower().split(),
x['B'].lower().split())),axis=1)
print(df)
A B A-B
0 Stack Overlflow is great stack great is overlflow
您的解决方案快到了,只需在压缩它们时添加series.str.lower()
:
df['A-B']=[' '.join(set(a.split())-set(b.split()))
for a, b in zip(df['A'].str.lower(), df['B'].str.lower())]
如果系列中有重复的字符串,请使用OrderedDict
,这有助于将重复项删除为set()
,但也要保持顺序:
df = pd.DataFrame({'A': ['Stack Overlflow is great is great'], 'B': ['stack great']})
A B
0 Stack Overlflow is great is great stack great
from collections import OrderedDict
df['A-B']=[' '.join([ele for ele in OrderedDict.fromkeys(a) if ele not in b ])
for a,b in zip(df.A.str.lower().str.split(),df.B.str.lower().str.split())]
print(df)
A B A-B
0 Stack Overlflow is great is great stack great overlflow is
答案 1 :(得分:2)
示例df
是:
>>> df = pd.DataFrame({'A': ['Stack Overlflow is great'], 'B': ['stack great']})
您可以使用apply
:
>>> df['A-B'] = df.apply(lambda x: ' '.join([i for i in x[0].split() if i.lower() not in x[1].split()]), axis=1)
>>> df
A B A-B
0 Stack Overlflow is great stack great Overlflow is
>>>
答案 2 :(得分:0)
试试这个衬垫:
' '.join(list(set(list(df.A.str.lower().str.split(' '))[0])-set(list(df.B.str.lower().str.split(' '))[0])))
只需将两个列值都转换为小写并按空格分割,然后将其放入列表中,然后获取一组这些列表并按空格将它们连接起来。