从文本文件在python中创建向量时出错

时间:2015-12-02 20:54:25

标签: python nlp

我想从文本文件中导入数据,并用单词进行向量空间表示:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit(f)
bag_of_words = vectorizer.transform(f)
print(bag_of_words)

但是我收到了这个错误:

Traceback (most recent call last):
  File "D:\test\test.py", line 5, in <module>
    bag_of_words = vectorizer.fit(f)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 776, in fit
self.fit_transform(raw_documents)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 110, in decode
doc = doc.read()
AttributeError: 'str' object has no attribute 'read'

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

vectorizer.fit方法需要一个可迭代的文件或字符串对象(不是单个文件对象),因此你应该有vectorizer.fit([f])

此外,您无法在第二次调用f时重用vectorizer.transform(因为此时已读取该文件)。您可能想要做的是以下内容:

vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit_transform([f])