考虑我在python中的代码,minemaggi.txt文件包含推文,我试图删除停止词,但在输出文件中,推文不是单独出现的。 此外,我想从文本文件中删除所有链接,该怎么做。
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import codecs
import nltk
stopset = set(stopwords.words('english'))
writeFile = codecs.open("outputfile.txt", "w", encoding='utf-8')
with codecs.open("minemaggi.txt", "r", encoding='utf-8') as f:
line = f.read()
new = '\n'
tokens = nltk.word_tokenize(line)
tokens = [w for w in tokens if not w in stopset]
for token in tokens:
writeFile.write('{}{}'.format(' ', token))
writeFile.write('{}'.format(new))
答案 0 :(得分:0)
您需要在写入文件的字符串中显式添加换行符,如下所示:
writeFile.write('{}{}\n'.format(' ', token))
答案 1 :(得分:0)
我会使用' '.join()
重新加入这些字,然后一次写一行:
with codecs.open("minemaggi.txt", "r", encoding='utf-8') as f:
# loop over all lines in the input-file
for line in f:
# as before remove the stopwords ...
tokens = nltk.word_tokenize(line)
tokens = [w for w in tokens if not w in stopset]
# Rejoin the words separated by one space.
line = ' '.join(tokens)
# Write processed line to the output file.
writeFile.write('{}\n'.format(line))
希望有所帮助。