一次热编码器转换不支持具有未知类别的“ df.column.cat.categories”

时间:2019-02-26 18:12:55

标签: dask dask-distributed dask-ml

我正在尝试将dask数据框中的分类特征转换为一种热门编码器格式。我的数据框是关于分类特征的:

df.dtypes
a                category
b                category
c                category
Length: 3, dtype: object

所以我认为这就像调用OneHotEncoder实例一样简单:

from dask_ml.preprocessing import OneHotEncoder

enc = OneHotEncoder()
enc.fit_transform(df)

但是看起来好像不是这样,并且抛出了此错误跟踪:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-6-f656d2d2eec8> in <module>
      2 
      3 enc = OneHotEncoder()
----> 4 enc.fit_transform(df_churn_train[columns_categorical])

dask_environment/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
    514                 self._categorical_features, copy=True)
    515         else:
--> 516             return self.fit(X).transform(X)
    517 
    518     def _legacy_transform(self, X):

dask_environment/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in fit(self, X, y)
    126 
    127         if isinstance(X, (pd.Series, pd.DataFrame)) or dask.is_dask_collection(X):
--> 128             self._fit(X, handle_unknown=self.handle_unknown)
    129         else:
    130             super(OneHotEncoder, self).fit(X, y=y)

dask_environment/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
    174                 for col in X.columns:
    175                     Xi = X[col]
--> 176                     cats = _encode(Xi, uniques=Xi.cat.categories)
    177                     self.categories_.append(cats)
    178                     self.dtypes_.append(Xi.dtype)

dask_environment/lib/python3.6/site-packages/dask/dataframe/categorical.py in categories(self)
    211                    "supported.  Please use `column.cat.as_known()` or "
    212                    "`df.categorize()` beforehand to ensure known categories")
--> 213             raise NotImplementedError(msg)
    214         return self._delegate_property(self._series._meta, 'cat', 'categories')
    215 

NotImplementedError: `df.column.cat.categories` with unknown categories is not supported.  Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories

我认为我无法使用as_known方法来调用cat accesor,因为它仅适用于Series,而不适用于DataFrame。因此,我尝试将.categorized()调用到我的df对象中,但始终收到此错误:

KilledWorker: ("('assign-astype-fillna-get-categories-chunk-getitem-pandas_read_text-read-block-from-delayed-xxxxxxxxxx', 30)", 'tcp://x.x.x.x:x')

有什么主意吗?

0 个答案:

没有答案