Question

我正在尝试从文本文件中删除停用词。该文本文件由9000多个句子组成，每个句子各自独立。

代码似乎工作得很好，但我显然遗漏了一些东西，因为输出文件已从文本文档中删除了行结构，我显然希望保留它。

这是代码;

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(inFile.read())
    for w in words:
        if w not in stop_words:
            outFile.write(w)
outFile.close()

我应该使用某种行标记器代替单词标记化吗？我检查了nltk文件，但我无法理解它（我仍然是这个东西的新手）。

Answer 1

如果您想保留行结构，只需逐行读取文件并在每行后添加换行符：

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    for line in infile:
        words = word_tokenize(line)
        for w in words:
            if w not in stop_words:
                outFile.write(w)
        output.write('\n')

Answer 2

我建议逐行阅读文件。这样的事情可能有用：

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    for line in inFile.readlines():
        words = word_tokenize(line)
        filtered_words = " ".join(w for w in words if w not in stop_words)
        outFile.write(filtered_words + '\n')

如果with - 语句按预期工作，则不必在

之后关闭outFile

Python：txt文件输出的停用词不是每行

2 个答案: