我想将稀疏矩阵(156060x11780)转换为数据帧,但我得到内存错误这是我的代码
vect = TfidfVectorizer(sublinear_tf=True, analyzer='word',
stop_words='english' , tokenizer=tokenize,
strip_accents = 'ascii')
X = vect.fit_transform(df.pop('Phrase')).toarray()
for i, col in enumerate(vect.get_feature_names()):
df[col] = X[:, i]
我在X = vect.fit_transform(df.pop('Phrase')).toarray()
遇到了问题。我该如何解决?
答案 0 :(得分:3)
试试这个:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, analyzer='word', stop_words='english',
tokenizer=tokenize,
strip_accents='ascii',dtype=np.float16)
X = vect.fit_transform(df.pop('Phrase')) # NOTE: `.toarray()` was removed
for i, col in enumerate(vect.get_feature_names()):
df[col] = pd.SparseSeries(X[:, i].toarray().reshape(-1,), fill_value=0)
对于Pandas 0.20+, 更新:我们可以直接从稀疏数组构建SparseDataFrame
:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, analyzer='word', stop_words='english',
tokenizer=tokenize,
strip_accents='ascii',dtype=np.float16)
df = pd.SparseDataFrame(vect.fit_transform(df.pop('Phrase')),
columns=vect.get_feature_names(),
index=df.index)