sklearn columntransformer包括不需要在管道中进行转换/预处理的现有功能(例如:布尔值)

时间:2019-03-31 19:39:26

标签: python scikit-learn

我已经具有一些布尔特征(1或0),但是我有一些需要OHE的分类变量和一些需要推算/缩放的数值变量...我可以将分类变量+数值变量添加到管道列变换器中但是如何将布尔功能添加到管道中,以便将其包含在模型中?找不到任何例子或一个好的短语来搜索这种困境...有什么想法吗?

这是sklearn结合了num和cat管道的示例,但是如果我的某些功能已经是布尔形式(1/0)并且不需要预处理/ OHE怎么办...我如何保留这些功能(即将其与num和cat变量一起添加到管道中?)

来源:Locator Strategies

titanic_url = ('https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

2 个答案:

答案 0 :(得分:0)

我在这里找到了自己的问题...使用ColumnTransformer,我可以将更多功能添加到列表(例如,我的问题代码中的numeric_features和categorical_features)和FeatureUnion这个DF Selector类可以向管道添加功能...详细信息可以在此笔记本中找到=> https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

# Create a class to select numerical or categorical columns 
class PandasDataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

# ex
feature_list = [...]
("num_features", Pipeline([\
                    ("select_num_features", PandasDataFrameSelector(feature_list)),\
                    ("scales", StandardScaler())

答案 1 :(得分:0)

使用 remainder='passthrough 传递 column_transformers 列表中未处理的任何列。也就是说,@thePurplePython 的回答非常有用。

preprocessor_pipeline = sklearn.compose.ColumnTransformer(column_transformers, remainder='passthrough')

或者,传递“passthrough”的管道三元组而不是转换器函数和要通过的列列表。

('passthrough','passthrough',passthrough_columns)