这个错误很奇怪,我甚至无法在谷歌上找到任何关于它的内容。
我正在尝试对现有稀疏数据帧中的列进行热编码,
combined_cats
是所有可能类别的集合。 column_name
是通用列名。
df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)
但是,这会因标题中的错误而失败。我认为你不能热编码稀疏矩阵,但我似乎无法通过to_dense()将其转换回密集矩阵,因为它说numpy ndarray没有这样的方法。
我尝试使用as_matrix()并重置列:
df[column_name] = df[column_name].as_matrix()
df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)
哪个也没用。有什么东西我做错了吗?当我尝试使用combined_cats时发生错误。
例如:
def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
col1b = set(df2[column_name].unique())
col1a = set(df[column_name].unique())
combined_cats = list(col1a.union(col1b))
df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)
df2[column_name] = df2[column_name].astype('category', categories=combined_cats,copy=False)
df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
try:
del df[column_name]
del df2[column_name]
except:
pass
return df,df2
df = pd.DataFrame({"col1":['a','b','c','d'],"col2":["potato","tomato","potato","tomato"],"col3":[1,1,1,1]})
df2 = pd.DataFrame({"col1":['g','b','q','r'],"col2":["potato","flowers","potato","flowers"],"col3":[1,1,1,1]})
## Hot encode col1
df,df2 = hot_encode_column_in_both_datasets("col1",df,df2)
len(df.columns) #9
len(df2.columns) #9
## Hot encode col2 as well
df,df2 = hot_encode_column_in_both_datasets("col2",df,df2)
Traceback (most recent call last):
File "<ipython-input-44-d8e27874a25b>", line 1, in <module>
df,df2 = hot_encode_column_in_both_datasets("col2",df,df2)
File "<ipython-input-34-5ae1e71bbbd5>", line 331, in hot_encode_column_in_both_datasets
df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)
File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2419, in __setitem__
self._set_item(key, value)
File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2485, in _set_item
value = self._sanitize_column(key, value)
File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/frame.py", line 324, in _sanitize_column
clean = value.reindex(self.index).as_sparse_array(
File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 573, in reindex
return self.copy()
File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 555, in copy
return self._constructor(new_data, sparse_index=self.sp_index,
File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2744, in __getattr__
return object.__getattribute__(self, name)
File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 242, in sp_index
return self.block.sp_index
AttributeError: 'CategoricalBlock' object has no attribute 'sp_index'
答案 0 :(得分:2)
As i said before我会在这种情况下使用CountVectorizer方法。
演示:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(vocabulary=np.union1d(df.col2, df2.col2))
r1 = pd.SparseDataFrame(cv.fit_transform(df.col2),
columns=cv.get_feature_names(),
index=df.index, default_fill_value=0)
r2 = pd.SparseDataFrame(cv.fit_transform(df2.col2),
columns=cv.get_feature_names(),
index=df2.index, default_fill_value=0)
注意:pd.SparseDataFrame(sparse_array)
构造函数是Pandas 0.20.0的新功能,因此我们需要Pandas 0.20.0+才能使用此解决方案
结果:
In [15]: r1
Out[15]:
flowers potato tomato
0 0.0 1 0
1 0.0 0 1
2 0.0 1 0
3 0.0 0 1
In [16]: r2
Out[16]:
flowers potato tomato
0 0 1 0.0
1 1 0 0.0
2 0 1 0.0
3 1 0 0.0
注意内存使用情况:
In [17]: r1.memory_usage()
Out[17]:
Index 80
flowers 0 # 0 * 8 bytes
potato 16 # 2 * 8 bytes (int64)
tomato 16 # ...
dtype: int64
In [18]: r2.memory_usage()
Out[18]:
Index 80
flowers 16
potato 16
tomato 0
dtype: int64