我有一个csv文件,在一列中有10行文本。对于每一行,我想删除停用词并获取相同的csv文件,只需减去停用词。
这是我的代码:
def remove_stopwords(filename):
new_text_list=[]
cr = csv.reader(open(filename,"rU").readlines()[1:])
cachedStopWords = stopwords.words("english")
for row in cr:
text = ' '.join([word for word in row.split() if word not in cachedStopWords])
print text
new_text_list.append(text)
但是我一直收到这个错误:
AttributeError: 'list' object has no attribute 'split'
因此,我的csv文件中的行似乎无法使用.split进行拆分,因为它们是一个列表。我怎么能绕过这个?
以下是我的csv文件的样子
Text
I am very pleased with the your software for contractors. It is tailored quite neatly for the construction industry.
We have two different companies, one is real estate management and one is health and nutrition services. It works great for both.
所以上面的例子是我的csv文件的前3行。 当我运行这行代码时:
cr = csv.reader(open(filename,"rU").readlines()[1:])
print cr[2]
我明白了:
['We have two different companies, one is real estate management and one is health and nutrition services. It works great for both.']
谢谢,
答案 0 :(得分:2)
您的数据文件不是CSV - 单词由空格分隔,而不是逗号。因此,您不需要CSV模块。相反,只需从文件中读取每一行,然后使用row = line.split()
将该行拆分为空格。
def remove_stopwords(filename):
new_text_list = []
cachedStopWords = set(stopwords.words("english"))
with open(filename, "rU") as f:
next(f) # skip one line
for line in f:
row = line.split()
text = ' '.join([word for word in row
if word not in cachedStopWords])
print(text)
new_text_list.append(text)
顺便说一下,checking membership in a set
是O(1)操作,而检查list
中的成员资格是O(n)操作。因此,使cachedStopWords
成为一个集合是有利的:
cachedStopWords = set(stopwords.words("english"))