如何根据存储在Pandas DataFrame中的分类数据为CSR / COO格式创建一个巨大的特征向量(50000 x 100000)的稀疏矩阵?我正在使用Pandas get_dummies()函数创建特征向量,但它返回一个MemoryError。我如何避免这种情况,而是以稀疏矩阵CSR格式生成它?
答案 0 :(得分:0)
可能有用的链接:
Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix
http://pandas.pydata.org/pandas-docs/stable/sparse.html
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
答案 1 :(得分:0)
使用:
scipy.sparse.coo_matrix(df_dummies)
但不要忘记首先创建df_dummies 稀疏 ......
df_dummies = pandas.get_dummies(df, sparse=True)
答案 2 :(得分:0)
这个答案将尽可能保持数据稀疏,避免在使用Pandas get_dummies时出现内存问题。
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
df = pd.DataFrame({'rowid':[1,2,3,4,5], 'category':['c1', 'c2', 'c1', 'c3', 'c1']})
print 'Input data frame\n{0}'.format(df)
print 'Encode column category as numerical variables'
print LabelEncoder().fit_transform(df.category)
print 'Encode column category as dummy matrix'
print OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1)).todense()
print 'Concat with the original data frame as a matrix'
dummy_matrix = OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1))
df_as_sparse = sparse.csr_matrix(df.drop(labels=['category'], axis=1).as_matrix())
sparse_combined = sparse.hstack((df_as_sparse, dummy_matrix), format='csr')
print sparse_combined.todense()