如何使用SimpleImputer类在具有不同常量值的不同列中插入缺失值?

时间:2019-07-16 14:26:04

标签: python pandas scikit-learn

我使用sklearn.impute.SimpleImputer(strategy='constant',fill_value= 0)来插补所有缺少常量值(此处为0的常量)的列。

但是,有时候在不同的列中插入不同的常量值是有意义的。例如,我可能想用该列的最大值替换某个列的所有NaN值,或者用最小值替换某些其他列的NaN值,或者假设该特定列值的中位数/均值

我该如何实现?

此外,我实际上是该领域的新手,所以我不确定是否这样做会改善我模型的结果。欢迎您发表意见。

1 个答案:

答案 0 :(得分:0)

如果要使用不同的任意值或中位数来插补不同的特征,则需要在管道中设置几个SimpleImputer步骤,然后将它们与ColumnTransformer结合起来:

1, 100, NULL  
3, 101, NULL  
4, 102, 1/1/2019  

或者,您可以使用Feature-Engine包,其中的转换器可以让您指定功能:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# first we need to make lists, indicating which features
# will be imputed with each method

features_numeric = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
features_categoric = ['BsmtQual', 'FireplaceQu']

# then we instantiate the imputers, within a pipeline
# we create one imputer for numerical and one imputer
# for categorical

# this imputer imputes with the mean
imputer_numeric = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

# this imputer imputes with an arbitrary value
imputer_categoric = Pipeline(
    steps=[('imputer',
            SimpleImputer(strategy='constant', fill_value='Missing'))])

# then we put the features list and the transformers together
# using the column transformer

preprocessor = ColumnTransformer(transformers=[('imputer_numeric',
                                                imputer_numeric,
                                                features_numeric),
                                               ('imputer_categoric',
                                                imputer_categoric,
                                                features_categoric)])

# now we fit the preprocessor
preprocessor.fit(X_train)

# and now we can impute the data
# remember it returs a numpy array

X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

功能引擎返回数据帧。 link中的更多信息。

要安装Feature-Engine,请执行以下操作:

from feature_engine import missing_data_imputers as msi
from sklearn.pipeline import Pipeline as pipe

pipe = pipe([
    # add a binary variable to indicate missing information for the 2 variables below
    ('continuous_var_imputer', msi.AddNaNBinaryImputer(variables = ['LotFrontage', 'GarageYrBlt'])),

    # replace NA by the median in the 3 variables below, they are numerical
    ('continuous_var_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea'])),

    # replace NA by adding the label "Missing" in categorical variables (transformer will skip those variables where there is no NA)
    ('categorical_imputer', msi.CategoricalVariableImputer(variables = ['var1', 'var2'])),

    # median imputer
    # to handle those, I will add an additional step here
    ('additional_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['var4', 'var5'])),
     ])

pipe.fit(X_train)
X_train_t = pipe.transform(X_train)

希望有帮助