Python - NLTK分隔标点符号

时间:2016-09-09 02:35:50

标签: python nltk

我对Python很陌生,我试图使用NLTK删除文件的停用词。 代码正在运行,但是如果我的文本是带有提及的推文(@user),那么它会分隔标点符号,我得到" @ user"。 后来我需要做一个单词频率,我需要提及和标签才能正常工作。 我的代码:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
arquivo = open('newfile.txt', encoding="utf8")
linha = arquivo.readline()
while linha:
    stop_word = set(stopwords.words("portuguese"))
    word_tokens = word_tokenize(linha)
    filtered_sentence = [w for w in word_tokens if not w in stop_word]
    filtered_sentence = []
    for w in word_tokens:
       if w not in stop_word:
           filtered_sentence.append(w)
    fp = codecs.open("stopwords.txt", "a", "utf-8")
    for words in (filtered_sentence):
        fp.write(words + " ")
    fp.write("\n")
    linha= arquivo.readline()

修改 不确定这是否是最好的方法,但我这样解决了:

for words in (filtered_sentence):
        fp.write(words)
        if words not in string.punctuation:
            fp.write(" ")
    fp.write("\n")

1 个答案:

答案 0 :(得分:3)

而不是firstMenu,您可以使用nltk提供的Twitter-aware tokenizer

word_tokenize