Question

我需要

1）从停用词列表中清除.txt，我在单独的.txt中。

2）之后我需要计算最常用的25个单词。

这是我第一部分提出的：

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

import re
from collections import Counter

f=open("text_to_be_cleaned.txt")
txt=f.read()
with open("stopwords.txt") as f:
    stopwords = f.readlines()
stopwords = [x.strip() for x in stopwords]

querywords = txt.split()
resultwords  = [word for word in querywords if word.lower() not in stopwords]
cleantxt = ' '.join(resultwords)

对于第二部分，我正在使用此代码：

words = re.findall(r'\w+', cleantxt)
lower_words = [word.lower() for word in words]
word_counts = Counter(lower_words).most_common(25)
top25 = word_counts[:25]

print top25

要清理的源文件如下所示：

（b）中

在第二段第一句中，应在最后插入“和高级代表”;在第二句中，“它应举行年度辩论”一词应改为“每年两次进行辩论”，并在最后插入“包括共同安全和防卫政策”等字样。 / p>

停用词列表如下所示： 这个 thises 他们你该然后因此鸟巢 thener 它们

当我运行所有这些时，输出仍然包含禁用词列表中的单词：
[（'article'，911），（'european'，586），（'the'，586），（'council'，569），（'union'，530），（'member'，377），（ '州'，282），（'议会'，244），（'委托'，230），（'依据'，217），（'条约'，187），（'in'，174），（'程序'，161），（'政策'，137），（'合作'，136），（'立法'，136），（'代理'，130），（'行为'，125），（'修正'， 125），（'州'，123），（'规定'，115），（'安全'，113），（'措施'，111），（'采用'，109），（'共同'，108） ]

正如你可以说的那样，我刚刚开始学习python，所以我会非常感谢你的简单解释！：）

使用的文件可以在这里找到：

Stopwordlist

File to be cleaned

编辑：添加了源文件，stopwordfile和输出的示例。提供了sourcfiles

Answer 1

这是一种疯狂的猜测，但我认为问题在于：

querywords = txt.split()

您只需拆分文字，这意味着某些停用词可能仍会粘贴到标点符号，因此不会在下一步中进行过滤。

>>> text = "Text containing stop words like a, the, and similar"
>>> stopwords = ["a", "the", "and"]
>>> querywords = text.split()
>>> cleantxt = ' '.join(w for w in querywords if w not in stopwords)
>>> cleantxt
'Text containing stop words like a, the, similar'

相反，您可以像以后在代码中使用的那样使用re.findall：

>>> querywords = re.findall(r"\w+", text)
>>> cleantxt = ' '.join(w for w in querywords if w not in stopwords)
>>> cleantxt
'Text containing stop words like similar'

但请注意，这会将"re-arranged"等复合词分为"re"和"arranged"。如果这不是你想要的，你也可以使用它来分割空格，然后修剪（一些）pnuctuation字符（虽然文本中可能还有更多）：

querywords = [w.strip(" ,.-!?") for w in txt.split()]

只更改一行似乎可以解决您提供的输入文件的问题。

其余的看起来没问题，虽然有一些小问题：

您应该将stopwords转换为set，以便查找为O（1）而不是O（n）
确保lower停用词（如果它们尚未
如果您打算在之后再次拆分

' '.join

top25 = word_counts[:25]是多余的，列表最多已有25个元素

Answer 2

您的代码几乎就在那里，主要的错误是您正在运行正则表达式\w+，以便在清除str.split生成的单词后对单词进行分组。这不起作用，因为标点符号仍将附加到str.split结果。请尝试使用以下代码。

import re from collections import Counter with open('treaty_of_lisbon.txt', encoding='utf8') as f: target_text = f.read() with open('terrier-stopwords.txt', encoding='utf8') as f: stop_word_lines = f.readlines() target_words = re.findall(r'[\w-]+', target_text.lower()) stop_words = set(map(str.strip, stop_word_lines)) interesting_words = [w for w in target_words if w not in stop_words] interesting_word_counts = Counter(interesting_words) print(interesting_word_counts.most_common(25))

清理.txt并计算最常用的字词

2 个答案: