如何从包含Dari单词的文件中删除英语单词?

时间:2018-02-06 15:28:26

标签: python python-3.x python-2.7 nlp stanford-nlp

如何查找英文单词并将其从包含Dari单词的文件中删除?我尝试了这段代码,但我不知道如何改进它。

inp = open('Dari.pos', 'r')
out = open('DariNER.txt', 'w')

for line in iter(inp):
   ------------?
   out.write(word)
inp.close()
out.close()

2 个答案:

答案 0 :(得分:0)

您可以安装和使用nltk库。这为您提供了英语单词列表以及将每行分成单词的方法:

from nltk.tokenize import word_tokenize
from nltk.corpus import words

english = words.words()

with open('Dari.pos') as f_input, open('DariNER.txt', 'w') as f_output:
    for line in f_input:
        f_output.write(' '.join(word for word in word_tokenize(line) if word.lower() not in english) + '\n')

安装nltk后,您应该运行:

import nltk
nltk.download()

并使用它下载words

答案 1 :(得分:0)

infile = "Dari.pos"
outfile = "Cleaned_English_Tags.txt"

delete_list = ['NOUN', 'ADJ', 'PUNCT', 'INTJ', 'ADV', 'VERB', 'X', 'CCONJ', 'ADP', 'AUX', 'SCONJ', 'PRON', 'DET', 'NUM', 'AU']
fin = open(infile)
fout = open(outfile, 'w')

for line in fin:
    for word in delete_list:
        line = line.replace(word, " ")
    fout.write(line)

fin.close()
fout.close()