我正在尝试将dask数据框中的分类特征转换为一种热门编码器格式。我的数据框是关于分类特征的:
df.dtypes
a category
b category
c category
Length: 3, dtype: object
所以我认为这就像调用OneHotEncoder实例一样简单:
from dask_ml.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit_transform(df)
但是看起来好像不是这样,并且抛出了此错误跟踪:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-6-f656d2d2eec8> in <module>
2
3 enc = OneHotEncoder()
----> 4 enc.fit_transform(df_churn_train[columns_categorical])
dask_environment/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
514 self._categorical_features, copy=True)
515 else:
--> 516 return self.fit(X).transform(X)
517
518 def _legacy_transform(self, X):
dask_environment/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in fit(self, X, y)
126
127 if isinstance(X, (pd.Series, pd.DataFrame)) or dask.is_dask_collection(X):
--> 128 self._fit(X, handle_unknown=self.handle_unknown)
129 else:
130 super(OneHotEncoder, self).fit(X, y=y)
dask_environment/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
174 for col in X.columns:
175 Xi = X[col]
--> 176 cats = _encode(Xi, uniques=Xi.cat.categories)
177 self.categories_.append(cats)
178 self.dtypes_.append(Xi.dtype)
dask_environment/lib/python3.6/site-packages/dask/dataframe/categorical.py in categories(self)
211 "supported. Please use `column.cat.as_known()` or "
212 "`df.categorize()` beforehand to ensure known categories")
--> 213 raise NotImplementedError(msg)
214 return self._delegate_property(self._series._meta, 'cat', 'categories')
215
NotImplementedError: `df.column.cat.categories` with unknown categories is not supported. Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories
我认为我无法使用as_known
方法来调用cat accesor,因为它仅适用于Series,而不适用于DataFrame。因此,我尝试将.categorized()调用到我的df对象中,但始终收到此错误:
KilledWorker: ("('assign-astype-fillna-get-categories-chunk-getitem-pandas_read_text-read-block-from-delayed-xxxxxxxxxx', 30)", 'tcp://x.x.x.x:x')
有什么主意吗?