我有一个列表c
,有353000个元素。每个元素都是一个解析字符串。此列表的一个示例是:
print c[25:50]
['aluminum co of america', 'aluminum co of america', 'aluminum co of america', 'aluminum company of america', 'aluminum company of america', 'aluminum co of america', 'aluminum company of america', 'aluminum company of america', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'ace cash express, inc.', 'ace cash express, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.']
我计算了列表中单词的频率:
from collections import Counter
r=[]
for e in c:
r.extend(e.split())
count=Counter(r)
因此,列表中最常用的六个词是:
{'inc.': 18670, 'corporation': 9255, 'company': 2632, 'group,': 1190, '&': 1158, 'financial': 1025}
我想删除列表中的这些元素。例如,如果我有"aluminum corporation of america"
,则输出应为"aluminum of america"
。有什么帮助吗?
答案 0 :(得分:1)
# Using Generator Expression with `Counter` to speed it up a little bit
from collections import Counter
count = Counter(item for e in c for item in e.split())
# Get most frequently used words
words = {item for item, cnt in count.most_common(6)}
# filter the `words` in `c` and reconstruct the sentences in `c`
[" ".join([item for item in e.split() if item not in words]) for e in c]
答案 1 :(得分:1)
您可以使用正则表达式将空字符串替换为要删除的字词:
import re
p = re.compile(' |'.join(word for word in count))
cleaned = [p.sub('', item) for item in c]
编辑:虽然,你必须逃避你的正则表达式中的.
和&
,所以它会变得比上面复杂一点......