如何使用在sklearn管道中保留一个编码

时间:2018-02-21 15:14:23

标签: python encoding scikit-learn pipeline categorical-data

我想测试使用sklearn管道在categorical encoding package中实现的不同编码策略。

我的意思是这样的:

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', LeaveOneOutEncoder()),
    ])
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

但是我收到了一个错误:

TypeError: fit() missing 1 required positional argument: 'y'

有人可以提出解决方案吗?

1 个答案:

答案 0 :(得分:0)

像我一样只显示部分代码。我添加XGBRegressor是因为我认为您可以预测房价

class MultiColumn(BaseEstimator, TransformerMixin):
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self
    def transform(self, X):                                                           
        return X[self.columns]

NUMERIC = df[['var1', 'var2']]
CATEGORICAL = df[['var3', 'var4']]

class Imputation(BaseEstimator, TransformerMixin):

    def transform(self, X, y=None, **fit_params):
        return X.fillna(NUMERIC.median())

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

class Cat(BaseEstimator, TransformerMixin):

    def transform(self, X, y=None, **fit_params):
        enc = DictVectorizer(sparse = False)
        encc = enc.fit(CATEGORICAL.T.to_dict().values())
        enc_data = encc.transform(X.T.to_dict().values())
        enc_data[np.isnan(enc_data)] = 1
        return enc_data

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

和管道

pipeline = Pipeline([

# Use FeatureUnion to combine the features
('union', FeatureUnion(
    transformer_list=[

                # numeric
        ('numeric', Pipeline([
            ('selector', MultiColumn(columns=['var1', 'var2'])),
            ('imp', Imputation()),
       ('scaling', preprocessing.StandardScaler(with_mean = 0.))

        ])),
         # categorical
        ('categorical', Pipeline([
            ('selector', MultiColumn(columns=['var3', 'var4'])),
            ('one_hot', Cat()),
            (CategoricalImputer())
        ])),


    ])),


 ('model_fitting', xgb.XGBRegressor()),
])