将分类值转换为数字/浮点索引?

时间:2019-04-18 18:30:34

标签: python pandas scikit-learn

在pandas或sklearn中是否可以将分类值转换为唯一的数字/浮点索引并包含在管道中? 必须坚持使用sklearn 18.1,因为这是服务器上可用的功能

ex:
c1 | c2
--------
cat1 | 1.0
cat2 | 2.0
cat2 | 2.0
cat3 | 3.0

遇到问题,因为labelEncoder和OneHotEncoder无法在管道中一起运行,并且无法通过OHE输入字符串数据...

为了应用预测,某些分类值并不总是存在,并且模型失败,因为我无法在管道中使用handle_unknown =“ ignore”

因此,我不得不将pandas.get_dummies()应用于OHE,由于缺少类别,我不想使用它

# tried

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

cat_pipeline = Pipeline([\
                    ("select", DataFrameSelector(cat_features)),\
                  ("index", LabelEncoder()),\
                  ("ohe", OneHotEncoder(handle_unknown="ignore"))\
                ])

test = cat_pipeline.fit_transform(X_train)
test.shape
# currently using but fails when applying predictions to new data w/o all the categories that were OHE via pandas.get_dummies()

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values


                ("cat", Pipeline([\
                    ("select", DataFrameSelector(cat_ohe))\
                ])),\

0 个答案:

没有答案