我是Python新手。我正在尝试读取CSV文件,在从文件中删除停用词后,我必须将其存储到新的CSV文件中。我的代码是删除停用词,但它将第一行复制到单行文件的每一行。 (例如,如果文件中有三行,则它会在第一行中复制第一行三次。)
正如我分析的那样,我认为问题出现在循环中,但我没有得到它。我的代码附在下面。
代码:
import nltk
import csv
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def stop_Words(fileName,fileName_out):
file_out=open(fileName_out,'w')
with open(fileName,'r') as myfile:
line=myfile.readline()
stop_words=set(stopwords.words("english"))
words=word_tokenize(line)
filtered_sentence=[" "]
for w in myfile:
for n in words:
if n not in stop_words:
filtered_sentence.append(' '+n)
file_out.writelines(filtered_sentence)
print "All Done SW"
stop_Words("A_Nehra_updated.csv","A_Nehra_final.csv")
print "all done :)"
答案 0 :(得分:2)
您只是阅读文件的第一行:line=myfile.readline()
。您想迭代文件中的每一行。一种方法是
with open(fileName,'r') as myfile:
for line in myfile:
# the rest of your code here, i.e.:
stop_words=set(stopwords.words("english"))
words=word_tokenize(line)
另外,你有这个循环
for w in myfile:
for n in words:
if n not in stop_words:
filtered_sentence.append(' '+n)
但是你会注意到最外层循环中定义的w
永远不会在循环中使用。您应该能够删除它并只写
for n in words:
if n not in stop_words:
filtered_sentence.append(' '+n)
编辑:
import nltk
import csv
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def stop_Words(fileName,fileName_out):
file_out=open(fileName_out,'w')
with open(fileName,'r') as myfile:
for line in myfile:
stop_words=set(stopwords.words("english"))
words=word_tokenize(line)
filtered_sentence=[""]
for n in words:
if n not in stop_words:
filtered_sentence.append(""+n)
file_out.writelines(filtered_sentence+["\n"])
print "All Done SW"