Question

我正在努力使用NLTK禁用词。

这是我的一些代码..有人能告诉我什么是错的吗？

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''

Answer 1

你的问题是字符串的迭代器返回每个字符而不是每个字。

例如：

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

你需要迭代并检查每个单词，幸运的是，split函数已经存在于string module下的python标准库中。但是，您正在处理包括标点符号在内的自然语言，您应该查看here以获得使用re模块的更强大的答案。

一旦你有一个单词列表，你应该在比较之前将它们全部小写，然后以你已经显示的方式比较它们。

Buena suerte。

编辑1

好的尝试这段代码，它应该适合你。它显示了两种方法，它们本质上是相同的，但第一种方法更清晰，而第二种方式则更为pythonic。

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words

我希望这会对你有所帮助。

Answer 2

首先使用标记器将令牌（符号）列表与停止列表进行比较，这样您就不需要重新模块。为了在语言之间切换，我添加了一个额外的参数。

def remove_stopwords(sentence, language):
    return [ token for token in nltk.word_tokenize(sentence) if token.lower() not in stopwords.words(language) ]

Dime si te fue de util;）

Answer 3

另一个具有更现代模块的选项（2020）

from nltk.corpus import stopwords
from textblob import TextBlob

def removeStopwords( texto):
    blob = TextBlob(texto).words
    outputlist = [word for word in blob if word not in stopwords.words('spanish')]
    return(' '.join(word for word in outputlist))

摆脱停用词和标点符号

3 个答案:

编辑1