我已经具有一些布尔特征(1或0),但是我有一些需要OHE的分类变量和一些需要推算/缩放的数值变量...我可以将分类变量+数值变量添加到管道列变换器中但是如何将布尔功能添加到管道中,以便将其包含在模型中?找不到任何例子或一个好的短语来搜索这种困境...有什么想法吗?
这是sklearn结合了num和cat管道的示例,但是如果我的某些功能已经是布尔形式(1/0)并且不需要预处理/ OHE怎么办...我如何保留这些功能(即将其与num和cat变量一起添加到管道中?)
titanic_url = ('https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
X = data.drop('survived', axis=1)
y = data['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
答案 0 :(得分:0)
我在这里找到了自己的问题...使用ColumnTransformer,我可以将更多功能添加到列表(例如,我的问题代码中的numeric_features和categorical_features)和FeatureUnion这个DF Selector类可以向管道添加功能...详细信息可以在此笔记本中找到=> https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb
# Create a class to select numerical or categorical columns
class PandasDataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# ex
feature_list = [...]
("num_features", Pipeline([\
("select_num_features", PandasDataFrameSelector(feature_list)),\
("scales", StandardScaler())
答案 1 :(得分:0)
使用 remainder='passthrough
传递 column_transformers
列表中未处理的任何列。也就是说,@thePurplePython 的回答非常有用。
preprocessor_pipeline = sklearn.compose.ColumnTransformer(column_transformers, remainder='passthrough')
或者,传递“passthrough”的管道三元组而不是转换器函数和要通过的列列表。
('passthrough','passthrough',passthrough_columns)