我有一个包含2列的数据框,1列包含单词串,例如:
Col1 Col2
0 1 how to remove this word
1 5 how to remove the word
我想删除在整个数据帧中出现一次的所有单词(阈值= 1),我会得到例如:(如果我可以指定阈值,则更好)
Col1 Col2
0 1 how to remove word
1 5 how to remove word
有什么建议吗?谢谢!
答案 0 :(得分:7)
让我们尝试使用Counter
:
from collections import Counter
from itertools import chain
# split words into lists
v = df['Col2'].str.split().tolist() # [s.split() for s in df['Col2'].tolist()]
# compute global word frequency
c = Counter(chain.from_iterable(v))
# filter, join, and re-assign
df['Col2'] = [' '.join([j for j in i if c[j] > 1]) for i in v]
df
Col1 Col2
0 1 how to remove word
1 5 how to remove word