Question

我正在尝试通过功能联合将所有不同的管道全部部署到一起，除了一个问题之外，其他所有东西都可以正常工作。

在我的DataFrame中，我有一个列ID，我希望在所有管道中保持不变。我必须将其提供给管道，因为我应用了一种热编码和其他内容，我不能只将其合并回最后。

    scaler_pipeline = Pipeline([
    ('selector', DataFrameSelector(col_scalar)),
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])
one_hot_pipeline = Pipeline([
    ('selector', DataFrameSelector(col_one_hot)),
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('one_hot', OneHotEncoder())
])

  full_pipeline = FeatureUnion(transformer_list=[
    ("DataFrameSelector", DataFrameSelector(immutable_col)),
    ("scaler_pipeline", scaler_pipeline),
    ("one_hot_pipeline", one_hot_pipeline),
])

我的DataFrameSelector就是这样：

class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
    self.attribute_names = attribute_names

def fit(self, X, y=None):
    return self

def transform(self, X):
    return X[self.attribute_names]

在“ full_pipeline”的开头，我想选择一些列（此处为ID），然后保留它而不触动它。

现在我收到此错误

TypeError：类型不受支持的转换：（dtype（'O'），dtype（'float64'），dtype（'float64'））

Answer 1

您可以将ColumnTransformer与remainder='passthrough'配合使用，以使转换器适合所选的列，而其他列保持不变。

scaler_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])

one_hot_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('one_hot', OneHotEncoder())
])

selector = ColumnTransformer([
    ('scalar', scaler_pipeline, col_scalar),
    ('one_hot', one_hot_pipeline, col_one_hot)
], remainder='passthrough')

请注意，如果您有ID或immutable_col以外的任何列，但col_scalar除外，则需要先删除它们。

或者，您可以为ID列创建一个直通变压器，并放置其他变压器：

selector = ColumnTransformer([
    ('scalar', scaler_pipeline, col_scalar),
    ('one_hot', one_hot_pipeline, col_one_hot), 
    ('passthough', FunctionTransformer(lambda x: x, lambda x: x), ['ID'])
], remainder='drop')

Sklearn管道保持在ID列不变

1 个答案: