Question

我正在尝试从文本中删除常用词。例如句子

“这不是一条平凡的河流，但从各个方面来说，都是引人注目的。”

我想把它变成唯一的词。这意味着删除“ it”，“ but”，“ a”等。我有一个包含所有常用词的文本文件，另一个包含段落的文本文件。如何删除段落文本文件中的常用词？

例如：

['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']

如何有效地从文件中删除常用词。我有一个名为common.txt的文本文件，其中列出了所有常用单词。如何使用该列表删除上面句子中的相同单词。结束输出我想要的：

['commonplace', 'river', 'contrary', 'remarkable']

这有意义吗？

谢谢。

Answer 1

以下是您可以使用的示例：

l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
    occurs[word] = l.count(word)

resultx = ''
for word in occurs.keys()
    if occurs[word] < 3:
        resultx += word + " "

resultx = resultx[:-1]

您可以根据自己的喜好更改3或使用以下方法基于平均值：

occurs.values()/len(occurs)

其他

如果不区分大小写，请使用：

更改第一行

l = text.replace(",","").replace(".","").lower().split(" ")

Answer 2

您希望在python中使用“设置”对象。

如果顺序和出现次数不重要：

str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']

common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']

set(str_list) - set(common_words)

>>> {'contrary', 'commonplace', 'river', 'remarkable'}

如果两者都很重要：

#Using "set" is so much faster
common_set = set(common_words)

[s for s in str_list if not s in common_set]

>>> ['commonplace', 'river', 'contrary', 'remarkable']

Answer 3

最简单的方法就是read()您的common.txt，然后使用列表理解，并且只使用not in就是我们所读取文件的字词< / p>

with open('common.txt') as f:
    content = f.read()

s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']

res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']

filter在这里也可以

res = list(filter(lambda x: x not in content, s))

从文本文件中删除常用词

3 个答案: