我正在尝试从文本文件中删除停用词。该文本文件由9000多个句子组成,每个句子各自独立。
代码似乎工作得很好,但我显然遗漏了一些东西,因为输出文件已从文本文档中删除了行结构,我显然希望保留它。
这是代码;
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
stop_words = set(stopwords.words('english'))
words = word_tokenize(inFile.read())
for w in words:
if w not in stop_words:
outFile.write(w)
outFile.close()
我应该使用某种行标记器代替单词标记化吗?我检查了nltk文件,但我无法理解它(我仍然是这个东西的新手)。
答案 0 :(得分:2)
如果您想保留行结构,只需逐行读取文件并在每行后添加换行符:
with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
stop_words = set(stopwords.words('english'))
for line in infile:
words = word_tokenize(line)
for w in words:
if w not in stop_words:
outFile.write(w)
output.write('\n')
答案 1 :(得分:2)
我建议逐行阅读文件。这样的事情可能有用:
with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
stop_words = set(stopwords.words('english'))
for line in inFile.readlines():
words = word_tokenize(line)
filtered_words = " ".join(w for w in words if w not in stop_words)
outFile.write(filtered_words + '\n')
如果with
- 语句按预期工作,则不必在