Python:txt文件输出的停用词不是每行

时间:2016-11-03 08:39:24

标签: python nltk

我正在尝试从文本文件中删除停用词。该文本文件由9000多个句子组成,每个句子各自独立。

代码似乎工作得很好,但我显然遗漏了一些东西,因为输出文件已从文本文档中删除了行结构,我显然希望保留它。

这是代码;

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(inFile.read())
    for w in words:
        if w not in stop_words:
            outFile.write(w)
outFile.close()

我应该使用某种行标记器代替单词标记化吗?我检查了nltk文件,但我无法理解它(我仍然是这个东西的新手)。

2 个答案:

答案 0 :(得分:2)

如果您想保留行结构,只需逐行读取文件并在每行后添加换行符:

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    for line in infile:
        words = word_tokenize(line)
        for w in words:
            if w not in stop_words:
                outFile.write(w)
        output.write('\n')

答案 1 :(得分:2)

我建议逐行阅读文件。这样的事情可能有用:

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    for line in inFile.readlines():
        words = word_tokenize(line)
        filtered_words = " ".join(w for w in words if w not in stop_words)
        outFile.write(filtered_words + '\n')

如果with - 语句按预期工作,则不必在

之后关闭outFile