Question

背景：

1）我有以下代码使用nltk软件包删除stopwords：

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in stopwords.words('english')]

2）此代码可删除stopwords，例如the，如下所示：

['dog', 'barks', 'tree', 'sees', 'squirrel']

3）我用以下代码更改了stopwords以保留单词not：

to_remove = ['not']
new_stopwords = set(stopwords.words('english')).difference(to_remove)

问题：

4）但是，当我将new_stopwords与以下代码一起使用时：

your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in new_stopwords.words('english')]

5）我收到以下错误，因为new_stopwords是set：

AttributeError: 'set' object has no attribute 'words'

问题：

6）如何使用新定义的new_stopwords获得所需的输出：

['dog', 'not','barks', 'tree', 'sees', 'squirrel']

Answer 1

您非常接近，但是您读到的错误消息是错误的：问题不是您所说的“ new_stopwords是set”，而是“ {{ 1}}没有属性set“

不是。 words是一个集合，这意味着您可以直接在列表理解中使用它：

new_stopwords

您还可以省去修改停用词列表的麻烦，只需使用两个条件即可：

filtered_words = [word for word in lower_tokens if word not in new_stopwords]

删除更改的停用词

1 个答案: