I want to use Sklearn to vectorize my data in a big csv file, I used the following code:
First TRY:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input='file', stop_words = 'english', ngram_range=(1,2))
vectorizer.fit_transform('test.csv')
But I got this error:
AttributeError: 'str' object has no attribute 'read'
Second TRY, but error was still raised:
import csv
file = open('test.csv', 'r')
f = file.readline()
vectorizer.fit_transform(f)
Third TRY: This one did work, but it was killed due to out of memory.
file = open('test.csv', 'r')
a = file.read()
vectorizer = TfidfVectorizer(stop_words = 'english', ngram_range=(1,2))
de = vectorizer.fit_transform(a.split('\n'))
How to use fit_transform in Sklearn to process a large CSV file?
答案 0 :(得分:0)
您认为自己的输入为file
,并且在两种情况下都给它string
(file.readline()
将文件的第一行作为string
返回。)< / p>
相反,给它一个文件。
请执行以下操作:
file = open('test.csv', 'r')
vectorizer.fit_transform(file)