from sklearn.feature_extraction.text import TfidfVectorizer
filename='train1.txt'
dataset=[]
with open(filename) as f:
for line in f:
dataset.append([str(n) for n in line.strip().split(',')])
print (dataset)
tfidf=TfidfVectorizer()
tfidf.fit(dataset)
dict1=tfidf.vocabulary_
print 'Using tfidfVectorizer'
for key in dict1.keys():
print key+" "+ str(dict1[key])
我正在读取文件train1.txt中的字符串。但是,当尝试执行语句tfidf.fit(dataset)时,将导致错误。我无法完全解决该错误。正在寻求帮助。
错误日志:
Traceback (most recent call last):
File "Q1.py", line 52, in <module>
tfidf.fit(dataset)
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1361, in fit
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
答案 0 :(得分:1)
根据TfidfVectorizer的docs,fit
函数期望“一个可迭代的对象,生成str,unicode或文件对象”作为其第一个参数。您要为其提供的列表列表不满足此要求。
您已经使用split
方法将每一行变成了一个字符串列表,因此您需要重新加入该字符串,或者完全避免分割它。当然,这取决于您的输入格式。
如果您修改该行,它将正常工作
dataset.append([str(n) for n in line.strip().split(',')])
根据您的输入格式,您可能需要将其替换为类似的
dataset.append(" ".join([str(n) for n in line.strip().split(',')]))
或者简单地
dataset.append(line.strip().replace(",", " "))
(我只能猜测您输入文本中“,”的用法)。