Question

我正在尝试在文本中应用停用词列表，但是在我需要从该列表中删除一些单词之前。问题是当我在文本中应用时，结果是无限循环

stop_words = set(stopwords.words('french'))
negation = ['n', 'pas', 'ne']
remove_words = [word for word in stop_words if word not in negation]
stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, remove_words)))
replace_stopwords = stopwords_regex.sub('', text)

print(replace_stopwords)

举一个例子很难，因为它只用一个词组就可以，但是在包含许多字符串的情况下，停用词被删除了，但是程序永不停止。

Answer 1

您可以首先使用tokenize nltk.RegexpTokenizer语料，然后删除修改后的stopwords

from nltk import RegexpTokenizer
from nltk.corpus import stopwords

content_french = ("John Richard Bond explique pas le rôle de l'astronomie.")
# initialize & apply tokenizer
toknizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')
content_french_token = toknizer.tokenize(content_french)
# initialize & modify stopwords
stop_words = set(stopwords.words('french'))
negation = {'n', 'pas', 'ne'}
stop_words = set([wrd for wrd in stop_words if wrd not in negation])
# modify your text
content_french = " ".join([wrd for wrd in content_french_token if wrd not in stop_words])

从字符串中删除停用词：为什么会有无限循环？

1 个答案: