Question

我正在尝试从文本输入中删除所有停用词。下面的代码删除了所有停用词，除了以一个句子开头的停用词。

如何删除这些单词？

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.corpus import stopwords
stopwords_nltk_en = set(stopwords.words('english'))

from string import punctuation
exclude_punctuation = set(punctuation)

stoplist_combined = set.union(stopwords_nltk_en, exclude_punctuation)

def normalized_text(text):
   lemma = WordNetLemmatizer()
   stopwords_punctuations_free = ' '.join([i for i in text.lower().split() if i not in stoplist_combined])
   normalized = ' '.join(lemma.lemmatize(word) for word in stopwords_punctuations_free.split())
return normalized


sentence = [['The birds are always in their house.'], ['In the hills the birds nest.']]

for item in sentence:
  print (normalized_text(str(item)))

OUTPUT: 
   the bird always house 
   in hill bird nest

Answer 1

罪魁祸首是这行代码：

print (normalized_text(str(item)))

如果您尝试在str(item)列表的第一个元素上打印sentence，则会得到：

['The birds are always in their house.']

然后降低并拆分为：

["['the", 'birds', 'are', 'always', 'in', 'their', "house.']"]

如您所见，第一个元素是['the，与停用词the不匹配。

解决方案：：使用''.join(item)将项目转换为str

评论后编辑

在文本字符串中仍然有一些顶点'。要解决此问题，请将normalized称为：

for item in sentence:
    print (normalized_text(item))

然后，使用import re导入正则表达式模块并更改：

text.lower().split()

具有：

re.split('\'| ', ''.join(text).lower())

删除以NLTK开头的停用词

1 个答案: