我在过去的几天里没有找到正确的指南,我已经在stackoverflow上查找了正确的教程和Q / A,主要是因为显示LabelBinarizer或OneHotEncoder用例的示例没有显示它如何合并到管道中,副反之亦然。
我有一个包含4个变量的数据集:
num1 num2 cate1 cate2
3 4 Cat 1
9 23 Dog 0
10 5 Dog 1
num1和num2是数字变量,cate1和cate2是分类变量。我知道我需要在拟合ML算法之前以某种方式对分类变量进行编码,但我不太确定如何在多次尝试后在管道中执行此操作。
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit (self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)
X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])
# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1
# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y,
test_size=0.5,
random_state=567,
stratify=y)
# Pipeline
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
('categorical', make_pipeline(Columns(names=X_cat_cols)))
])),
('LR_model', LogisticRegression()),
])
这给了我错误ValueError: could not convert string to float: 'Cat'
用此
替换最后一行('categorical', make_pipeline(Columns(names=X_cat_cols),OneHotEncoder()))
会给我相同的ValueError: could not convert string to float: 'Cat'
。
用此
替换最后一行('categorical', make_pipeline(Columns(names=X_cat_cols),LabelBinarizer(),OneHotEncoder()))
])),
会给我一个不同的错误TypeError: fit_transform() takes 2 positional arguments but 3 were given
。
用此
替换最后一行('numeric', make_pipeline(Columns(names=X_num_cols),LabelBinarizer())),
会给我这个错误TypeError: fit_transform() takes 2 positional arguments but 3 were given
。
答案 0 :(得分:1)
接受Marcus的建议,我试过但无法安装scikit-learn dev版本,但发现了一些名为category_encoders的类似内容。
将代码更改为:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
import category_encoders as CateEncoder
# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit (self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)
X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])
# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1
# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y,
test_size=0.5,
random_state=567,
stratify=y)
# Pipeline
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
('categorical', make_pipeline(Columns(names=X_cat_cols),CateEncoder.BinaryEncoder()))
])),
('LR_model', LogisticRegression()),
])
答案 1 :(得分:0)
对于我来说,我更喜欢使用LabelEncoder
。
只是玩具示例。
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn import linear_model
df= pd.DataFrame({ 'y': [10,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ],
'b': ['a', 'b','a', 'b','a', 'b','b', 'b' ], 'c': ['a', 'b','a', 'a','a', 'b','b', 'b' ]})
df
我定义class
来选择列
class MultiColumn():
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self
def transform(self, X):
return X[self.columns]
现在,我定义要用LabelEncoder
进行预处理的类
lb = df[['a', 'c']]
class MyLEncoder():
def transform(self, X, **fit_params):
enc = preprocessing.LabelEncoder()
enc_data = []
for i in list(lb.columns):
encc = enc.fit(lb[i])
enc_data.append(encc.transform(X[i]))
return np.asarray(enc_data).T
def fit_transform(self, X,y=None, **fit_params):
self.fit(X,y, **fit_params)
return self.transform(X)
def fit(self, X, y, **fit_params):
return self
我使用for-loop
是因为我们可以仅将LabelEncoder
应用于单个向量。
管道
X = df[['a', 'b', 'c']]
y = df['y']
regressor = linear_model.SGDRegressor()
pipeline = Pipeline([
# Use FeatureUnion to combine the features
('union', FeatureUnion(
transformer_list=[
# categorical
('categorical', Pipeline([
('selector', MultiColumn(columns=['a', 'c'])),
('one_hot', MyLEncoder())
])),
])),
# Use a regression
('model_fitting', linear_model.SGDRegressor()),
])
pipeline.fit(X, y)
pipeline.predict(X)
并检查新数据
new= pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ],'c': ['b', 'a' ], 'b': [3, 6],})
pipeline.predict(new)
类似地,我们可以对预处理分类数据的任何方法进行处理。
答案 2 :(得分:0)
LabelBinarizer和LabelEncoder适合和转换与管道不兼容的签名。因此,使用所需的签名创建自己的自定义变形金刚。
class LabelBinarizerPipelineFriendly(LabelBinarizer):
def fit(self, X, y=None):
"""this would allow us to fit the model based on the X input."""
super(LabelBinarizerPipelineFriendly, self).fit(X)
def transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).transform(X)
def fit_transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)