我正在尝试在文本中应用停用词列表,但是在我需要从该列表中删除一些单词之前。问题是当我在文本中应用时,结果是无限循环
stop_words = set(stopwords.words('french'))
negation = ['n', 'pas', 'ne']
remove_words = [word for word in stop_words if word not in negation]
stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, remove_words)))
replace_stopwords = stopwords_regex.sub('', text)
print(replace_stopwords)
举一个例子很难,因为它只用一个词组就可以,但是在包含许多字符串的情况下,停用词被删除了,但是程序永不停止。
答案 0 :(得分:2)
您可以首先使用tokenize
nltk.RegexpTokenizer
语料,然后删除修改后的stopwords
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
content_french = ("John Richard Bond explique pas le rôle de l'astronomie.")
# initialize & apply tokenizer
toknizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')
content_french_token = toknizer.tokenize(content_french)
# initialize & modify stopwords
stop_words = set(stopwords.words('french'))
negation = {'n', 'pas', 'ne'}
stop_words = set([wrd for wrd in stop_words if wrd not in negation])
# modify your text
content_french = " ".join([wrd for wrd in content_french_token if wrd not in stop_words])