Question

我有下面的代码，我正在尝试将停用词列表应用于单词列表。然而，结果仍然显示“a”和“the”这样的词，我认为这个词会被这个过程删除。任何出错的想法都会很棒。

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words

Answer 1

有几点需要注意。

如果您要反复检查列表中的成员资格，我会使用集合而不是列表。
stopwords.words('english')会返回小写停用词的列表。您的来源很可能包含大写字母，因此不匹配。
您没有正确读取文件，而是检查文件对象而不是按空格分割的单词列表。

全部放在一起：

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w

NLTK止动词列表

1 个答案: