Question

我对Python很陌生，我试图使用NLTK删除文件的停用词。代码正在运行，但是如果我的文本是带有提及的推文（@user），那么它会分隔标点符号，我得到＆＃34; @ user＆＃34;。后来我需要做一个单词频率，我需要提及和标签才能正常工作。我的代码：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
arquivo = open('newfile.txt', encoding="utf8")
linha = arquivo.readline()
while linha:
    stop_word = set(stopwords.words("portuguese"))
    word_tokens = word_tokenize(linha)
    filtered_sentence = [w for w in word_tokens if not w in stop_word]
    filtered_sentence = []
    for w in word_tokens:
       if w not in stop_word:
           filtered_sentence.append(w)
    fp = codecs.open("stopwords.txt", "a", "utf-8")
    for words in (filtered_sentence):
        fp.write(words + " ")
    fp.write("\n")
    linha= arquivo.readline()

修改不确定这是否是最好的方法，但我这样解决了：

for words in (filtered_sentence):
        fp.write(words)
        if words not in string.punctuation:
            fp.write(" ")
    fp.write("\n")

Answer 1

而不是firstMenu，您可以使用nltk提供的Twitter-aware tokenizer：

word_tokenize

Python - NLTK分隔标点符号

1 个答案: