当我在寻找如何准备sklearn.Pipeline的步骤以仅在某些列上运行时,我偶然发现sklearn.Pipeline.FeatureUnion来自this answer的stackoverflow。但是,我无法弄清楚如何不对不需要的列应用任何内容,以及如何将完整的数据传递给下一步。例如,在第一步中,我只想在某些列上应用StandardScaler
,可以使用下面显示的代码来完成,但是问题是,下一步将只包含标准缩放的列。下一步如何使用上一步中的标准列来获取完整数据?
下面是一些示例代码:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
pipe = Pipeline([
# steps below applies on only some columns
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=[list of numeric column names]), StandardScaler())),
])),
('feature_engineer_step1', FeatEng_1()),
('feature_engineer_step2', FeatEng_2()),
('feature_engineer_step3', FeatEng_3()),
('remove_skew', Skew_Remover()),
# below step applies on all columns
('model', RandomForestRegressor())
])
编辑:
由于选择的答案没有任何示例代码,因此我在这里粘贴我的代码,以供可能遇到此问题并希望找到有效代码的任何人使用。以下示例中使用的 data是google colab随附的加利福尼亚住房数据。
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
# writing a column transformer that operates on some columns
num_cols = ['housing_median_age', 'total_rooms','total_bedrooms', 'population', 'households', 'median_income']
p_stand_scaler_1 = ColumnTransformer(transformers=[('stand_scale', StandardScaler(), num_cols)],
# set remainder to passthrough to pass along all the un-specified columns untouched to the next steps
remainder='passthrough')
# make a pipeline now with all the steps
pipe_1 = Pipeline(steps=[('standard_scaler', p_stand_scaler_1),
('rf_regressor', RandomForestRegressor(random_state=100))])
# pass the data now to fit
pipe_1.fit(house_train.drop('median_house_value', axis=1), house_train.loc[:,'median_house_value'])
# make predictions
pipe_predictions = pipe_1.predict(house_test.drop('median_house_value', axis=1))
答案 0 :(得分:2)
您可以使用sklearn的ColumnTransformer。这是帮助您的代码段。
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
#transform columns
#num_cols = numerical columns, categorical_col = categorical columns
preprocessor = ColumnTransformer(transformers = [('minmax',MinMaxScaler(), num_cols),
('onehot', OneHotEncoder(), categorical_col)])
#model
model = RandomForestClassifier(random_state=0)
#model pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
model_pipeline.fit(x_train, y_train)
答案 1 :(得分:1)
我相信使用ColumnTransformer(来自sklearn.compose import ColumnTransformer)应该可以解决问题。
实例化列转换器时,可以设置restder ='passthrough',这将使其余列保持不变。
然后,您首先需要使用列转换器实例化管道对象。
这样,下一个管道步骤将按需要接收所有列。