如何在python的管道中将LabelBinarizer和OneHotEncoder结合用于分类变量?

时间:2018-02-27 22:00:37

标签: python machine-learning scikit-learn preprocessor feature-extraction

我在过去的几天里没有找到正确的指南,我已经在stackoverflow上查找了正确的教程和Q / A,主要是因为显示LabelBinarizer或OneHotEncoder用例的示例没有显示它如何合并到管道中,副反之亦然。

我有一个包含4个变量的数据集:

num1    num2    cate1    cate2
3       4       Cat      1
9       23      Dog      0
10      5       Dog      1

num1和num2是数字变量,cate1和cate2是分类变量。我知道我需要在拟合ML算法之前以某种方式对分类变量进行编码,但我不太确定如何在多次尝试后在管道中执行此操作。

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols)))
    ])),
    ('LR_model', LogisticRegression()),
])

这给了我错误ValueError: could not convert string to float: 'Cat'

用此

替换最后一行
('categorical', make_pipeline(Columns(names=X_cat_cols),OneHotEncoder()))

会给我相同的ValueError: could not convert string to float: 'Cat'

用此

替换最后一行
('categorical', make_pipeline(Columns(names=X_cat_cols),LabelBinarizer(),OneHotEncoder()))
])),

会给我一个不同的错误TypeError: fit_transform() takes 2 positional arguments but 3 were given

用此

替换最后一行
('numeric', make_pipeline(Columns(names=X_num_cols),LabelBinarizer())),

会给我这个错误TypeError: fit_transform() takes 2 positional arguments but 3 were given

3 个答案:

答案 0 :(得分:1)

接受Marcus的建议,我试过但无法安装scikit-learn dev版本,但发现了一些名为category_encoders的类似内容。

将代码更改为:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
import category_encoders as CateEncoder

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols),CateEncoder.BinaryEncoder()))
    ])),
    ('LR_model', LogisticRegression()),
])

答案 1 :(得分:0)

对于我来说,我更喜欢使用LabelEncoder。 只是玩具示例。

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn import linear_model

df= pd.DataFrame({ 'y': [10,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ],
                  'b': ['a', 'b','a', 'b','a', 'b','b', 'b' ],  'c': ['a', 'b','a', 'a','a', 'b','b', 'b' ]})
df

我定义class来选择列

class MultiColumn():
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self
    def transform(self, X):                                                           
        return X[self.columns]

现在,我定义要用LabelEncoder进行预处理的类

lb = df[['a', 'c']]
class MyLEncoder():

    def transform(self, X, **fit_params):
        enc = preprocessing.LabelEncoder()
        enc_data = []
        for i in list(lb.columns):
            encc = enc.fit(lb[i])
            enc_data.append(encc.transform(X[i]))

        return np.asarray(enc_data).T

    def fit_transform(self, X,y=None,  **fit_params):
        self.fit(X,y,  **fit_params)
        return self.transform(X)

    def fit(self, X, y, **fit_params):
        return self

我使用for-loop是因为我们可以仅将LabelEncoder应用于单个向量。 管道

X = df[['a', 'b', 'c']]
y = df['y']
regressor = linear_model.SGDRegressor()

pipeline = Pipeline([
    # Use FeatureUnion to combine the features
    ('union', FeatureUnion(
        transformer_list=[
             # categorical
            ('categorical', Pipeline([
                 ('selector', MultiColumn(columns=['a', 'c'])),
                ('one_hot', MyLEncoder())
            ])),
        ])),
    # Use a regression
    ('model_fitting', linear_model.SGDRegressor()),
])
pipeline.fit(X, y)
pipeline.predict(X)

并检查新数据

new= pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ],'c': ['b', 'a' ], 'b': [3, 6],})
pipeline.predict(new)

类似地,我们可以对预处理分类数据的任何方法进行处理。

答案 2 :(得分:0)

LabelBinarizer和LabelEncoder适合和转换与管道不兼容的签名。因此,使用所需的签名创建自己的自定义变形金刚。

class LabelBinarizerPipelineFriendly(LabelBinarizer):
    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)