OneHotEncoder在已经调用SimpleImputer后引发NaN问题

时间:2019-10-14 07:55:42

标签: python scikit-learn pipeline

我很难理解应该如何在Sklearn中使用管道。以下是使用泰坦尼克号数据集的示例。

p

我希望它可以填充所有缺失的data = pd.read_csv('datasets/train.csv') cat_attribs = ["Embarked", "Cabin", "Ticket", "Name"] num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), ]) str_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="most_frequent")), ]) full_pipeline = ColumnTransformer([ ("num", num_pipeline, ["Pclass", "Age", "SibSp", "Parch", "Fare"]), ("str", str_pipeline, ["Cabin", "Sex"]), ("cat", OneHotEncoder(), ["Cabin"]), ]) full_pipeline.fit_transform(data) 值(包括数字和字符串)属性,然后最终将NaN属性转换为数字属性。

相反,代码最终出现以下错误:

  

ValueError:输入包含NaN。如果我删除调用   OneHotEncoder并打印转换后的数组,没有NaN   值。

因此,我想知道。在这种情况下,我应该怎么打Cabin

1 个答案:

答案 0 :(得分:1)

我建议将OneHotEncoder应用于所有类别变量。因此,请将其作为单独的管道。

由于这是数字列的单步处理,因此可以直接使用ColumnTransformer

尝试一下!

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline

cat_preprocess = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder())

ct = make_column_transformer([
        ("num", SimpleImputer(strategy="median"), ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
        ("str", cat_preprocess, ["Cabin", "Sex"]),
    ])

pipeline = Pipeline([('preprocess', ct)])