在sklearn管道中合并多个编码器功能

时间:2020-07-07 21:21:32

标签: scikit-learn

我是一个包含分类特征和一些数字特征的数据。我正在使用各种类型的编码器将类别转换为数值以进行进一步分析。我想使用管道并将嵌入的结果合并在一起。

这是一个例子

url = 'https://raw.githubusercontent.com/stephstammel/steph_dot_ai/master/content/post/data/ks-projects-201801.csv'
df = pd.read_csv(url, nrows = 100, usecols = ['category', 'main_category', 'state', 'country', 'main_category', 'goal'] )
X_train, X_test, y_train, y_test = train_test_split(df.drop('goal', axis = 1), df.goal, test_size=0.5)

enter image description here

我有一个标签编码器

le = LabelEncoder()
X_label_encoded = le.fit_transform(X_train, y_train)
X_label_encoded.sample(10)

enter image description here

和目标编码器

# Target encode the categorical data
te = TargetEncoder()
X_target_encoded = te.fit_transform(X_train, y_train)
X_target_encoded.sample(10)

enter image description here

并且我有一个类似于

的管道
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import BayesianRidge
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

# Regression model
model_te_loo = Pipeline([
    ('encoder', TargetEncoder()),# here I would like to add TargetEncoder and LabelEncoder 
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
mae_scorer = make_scorer(mean_absolute_error)

scores = cross_val_score(model_te_loo, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('Cross-validated AUC: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

基本上只包含一种编码器。但是,我想使用所有编码器并垂直叠加结果-还要转换数字数据。

因此,我制作了类似于

的管道
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

X_num_cols = ['usd_pledged_real', 'backers']
X_cat_cols = ['category', 'main_category', 'backers', 'country', 'usd_pledged_real']
    
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols),LabelEncoder(), OneHotEncoder()))
    ])),
    ('LR_model', BayesianRidge())
])

我适合管道,看起来像

enter image description here

但是当我进行第一步转换时

pipe.steps[0][1].transform(X_train)

输出仍然具有相同的列数!这不是我期望的。

enter image description here

这是pipe.steps[0][1]

enter image description here

总而言之,我想要的是一个将所有分类嵌入以及没有重复或缺少列的数字结合起来的管道。

0 个答案:

没有答案