我有一个文件,其中包含停用词(每个都在一个新行中)和另一个文件(实际上是一个语料库),它由新行中的许多句子组成。我必须删除语料库中的停用词并返回每行不带停用词。我写了一个代码,但它只返回一个句子。 (语言是波斯语)。如何解决它返回所有句子?
with open ("stopwords.txt", encoding = "utf-8") as f1:
with open ("train.txt", encoding = "utf-8") as f2:
for i in f1:
for line in f2:
if i in line:
line= line.replace(i, "")
with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as f3:
f3.write (line)
答案 0 :(得分:0)
问题是你的最后两行代码不在for循环中。您正在逐行遍历整个f2,并且不执行任何操作。然后,在最后一行之后,将最后一行写入f3。相反,尝试:
with open("stopwords.txt", encoding = "utf-8") as stopfile:
stopwords = stopfile.readlines() # make it into a convenient list
print stopwords # just to check that this words
with open("train.txt", encoding = "utf-8") as trainfile:
with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as newfile:
for line in trainfile: # go through each line
for word in stopwords: # go through and replace each word
line= line.replace(word, "")
newfile.write (line)
答案 1 :(得分:0)
您可以遍历这两个文件,然后写入第三个文件。 @Noam是正确的,因为你的上一个文件的缩进有问题。
with open("stopwords.txt", encoding="utf-8") as sw, open("train.txt", encoding="utf-8") as train, open("NoStopWordsTrain.txt", "w", encoding="utf-8") as no_sw:
stopwords = sw.readlines()
no_sw.writelines(line + "\n" for line in train.readlines() if line not in stopwords)
这基本上只是在训练中写下所有行,如果它是其中一个停用词,则对其进行过滤。
如果您认为with open(...
行太长,可以使用Python的partial
函数设置默认参数。
from functools import partial
utfopen = partial(open, encoding="utf-8")
with utfopen("stopwords.txt") as sw, utfopen("train.txt") as train, utfopen("NoStopWordsTrain.txt", "w") as no_sw:
#Rest of your code here