使用管道避免X和y的数据泄漏

时间:2019-03-13 22:44:01

标签: python encoding scikit-learn one-hot-encoding imputation

我在此链接上非常仔细地关注了该示例:https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html 但使用了不同的数据集(可在此处找到:https://archive.ics.uci.edu/ml/datasets/Adult

与示例数据集不同,我的数据集在y(也称为“目标”)列中也包含两个空值。我想知道如何构建类似于sklearn文档中的管道的管道,该管道可以同时转换X和y中的列,而不只是转换X中的列。这是我已经尝试过的方法,但是当然不起作用,因为“目标”功能已从X删除,以便允许train_test_split:

import os
import pandas as pd
import numpy as np

os.listdir()

df_train = pd.read_csv('data_train.txt', header=None)
df_test = pd.read_csv('data_test.txt', header=None)

names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
        'occupation', 'relationship', 'race', 'sex', 'capital-sign', 'capital-loss', 
        'hours-per-week', 'native-country', 'target']

df = pd.concat([df_train, df_test], axis=0)

df.head()

df.columns = names

Data Leakage Example - Preprocessing is done on the entire dataset

df_leakage = df.copy()

#Exploring nulls: 
df_leakage[df_leakage.isnull().any(axis=1)]

There are only two nulls so it would be safe to discard them but for the purposes of this demo we will impute values 

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#Defining numeric features and the transformations we will apply to them
numeric_features = ['capital-sign', 'capital-loss', 'hours-per-week', 'education-num']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

#Defining categorical features and the transformations we will apply to them
categorical_features = ['marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

#Defining categorical ordinal features and the transformations we will apply to them
#Note: we already have a education-continuous feature available but we will demo how to apply an ordinal transformation:
ordinal_features = ['education']
ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder())])

#Defining categorical ordinal features and the transformations we will apply to them
#Note: we already have a education-continuous feature available but we will demo how to apply an ordinal transformation:
ordinal_features = ['education']
label_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('label', LabelEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('ord', ordinal_transformer, ordinal_features),
        ('lab', label_transformer, label_features),
    ])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = df_leakage.drop('target', axis=1)
y = df_leakage['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

Error log (partial):
    658             columns = list(key)
    659 
--> 660         return [all_columns.index(col) for col in columns]
    661 
    662     elif hasattr(key, 'dtype') and np.issubdtype(key.dtype, np.bool_):

ValueError: 'y' is not in list

很显然,在进行任何其他转换之前,我可以轻松地将转换分别应用于y。我认为处理字符串错误会很好,例如“。<= 50K”与“ <= 50K”。但是,如果我想用其平均值估算y的值,则估算的值将取决于y_train的特定选择-从而导致一些数据泄漏。

我将如何使用管道库有效地做到这一点?

0 个答案:

没有答案