Question

这个错误很奇怪，我甚至无法在谷歌上找到任何关于它的内容。

我正在尝试对现有稀疏数据帧中的列进行热编码，

combined_cats是所有可能类别的集合。 column_name是通用列名。

df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)

但是，这会因标题中的错误而失败。我认为你不能热编码稀疏矩阵，但我似乎无法通过to_dense（）将其转换回密集矩阵，因为它说numpy ndarray没有这样的方法。

我尝试使用as_matrix（）并重置列：

df[column_name] = df[column_name].as_matrix()
df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)

哪个也没用。有什么东西我做错了吗？当我尝试使用combined_cats时发生错误。

例如：

def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
    col1b = set(df2[column_name].unique())
    col1a = set(df[column_name].unique())
    combined_cats = list(col1a.union(col1b))
    df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)
    df2[column_name] = df2[column_name].astype('category', categories=combined_cats,copy=False)

    df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
    df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
    try:
        del df[column_name]
        del df2[column_name]
    except:
        pass
    return df,df2



df = pd.DataFrame({"col1":['a','b','c','d'],"col2":["potato","tomato","potato","tomato"],"col3":[1,1,1,1]})
df2 = pd.DataFrame({"col1":['g','b','q','r'],"col2":["potato","flowers","potato","flowers"],"col3":[1,1,1,1]})

## Hot encode col1
df,df2 = hot_encode_column_in_both_datasets("col1",df,df2)

len(df.columns) #9
len(df2.columns) #9

## Hot encode col2 as well
df,df2 = hot_encode_column_in_both_datasets("col2",df,df2)

Traceback (most recent call last):

  File "<ipython-input-44-d8e27874a25b>", line 1, in <module>
    df,df2 = hot_encode_column_in_both_datasets("col2",df,df2)

  File "<ipython-input-34-5ae1e71bbbd5>", line 331, in hot_encode_column_in_both_datasets
    df[column_name] = df[column_name].astype('category', categories=combined_cats,copy=False)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/frame.py", line 324, in _sanitize_column
    clean = value.reindex(self.index).as_sparse_array(

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 573, in reindex
    return self.copy()

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 555, in copy
    return self._constructor(new_data, sparse_index=self.sp_index,

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2744, in __getattr__
    return object.__getattribute__(self, name)

  File "/storage/programfiles/anaconda3/lib/python3.5/site-packages/pandas/sparse/series.py", line 242, in sp_index
    return self.block.sp_index

AttributeError: 'CategoricalBlock' object has no attribute 'sp_index'

Answer 1

As i said before我会在这种情况下使用CountVectorizer方法。

演示：

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(vocabulary=np.union1d(df.col2, df2.col2))

r1 = pd.SparseDataFrame(cv.fit_transform(df.col2), 
                        columns=cv.get_feature_names(),
                        index=df.index, default_fill_value=0)

r2 = pd.SparseDataFrame(cv.fit_transform(df2.col2), 
                        columns=cv.get_feature_names(),
                        index=df2.index, default_fill_value=0)

注意：pd.SparseDataFrame(sparse_array)构造函数是Pandas 0.20.0的新功能，因此我们需要Pandas 0.20.0+才能使用此解决方案

结果：

In [15]: r1
Out[15]:
   flowers  potato  tomato
0      0.0       1       0
1      0.0       0       1
2      0.0       1       0
3      0.0       0       1

In [16]: r2
Out[16]:
   flowers  potato  tomato
0        0       1     0.0
1        1       0     0.0
2        0       1     0.0
3        1       0     0.0

注意内存使用情况：

In [17]: r1.memory_usage()
Out[17]:
Index      80
flowers     0   # 0 * 8 bytes
potato     16   # 2 * 8 bytes (int64)
tomato     16   # ...
dtype: int64

In [18]: r2.memory_usage()
Out[18]:
Index      80
flowers    16   
potato     16
tomato      0   
dtype: int64

AttributeError：'CategoricalBlock'对象没有属性'sp_index'

1 个答案: