我有以下代码。我必须在nltk禁用词列表中添加更多单词。运行thsi后,它不会在列表中添加单词
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
new_words = open("stopwords_en.txt", "r")
new_stopwords = stop.union(new_word)
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in emails_body_text]
答案 0 :(得分:1)
@greg_data
建议的代码开始,但您需要删除换行符并执行其他操作 - 谁知道您的停用词文件是什么样的?
这可能会这样做,例如:
new_words = open("stopwords_en.txt", "r").read().split()
new_stopwords = stop.union(new_words)
PS。不要分裂和加入你的文件;标记化一次并使用标记列表。