如何在sklearn管道中仅标准化数值变量?

时间:2018-02-07 21:18:02

标签: python scikit-learn

我正在尝试使用两个步骤创建一个sklearn管道:

  1. 标准化数据
  2. 使用KNN
  3. 拟合数据

    但是,我的数据包含数字和分类变量,我已使用X = dataframe containing both numeric and categorical columns numeric = [list of numeric column names] categorical = [list of categorical column names] scaler = StandardScaler() X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric) X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True) 将其转换为虚拟变量。我想标准化数值变量,但保留虚拟对象。我这样做是这样的:

    pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())
    

    但是,如果我要创建一个像:

    这样的管道
    seq(1:years)

    它会标准化我的DataFrame中的所有列。有没有办法在仅标准化数字列时执行此操作?

4 个答案:

答案 0 :(得分:10)

假设你有以下DF:

In [163]: df
Out[163]:
     a     b    c    d
0  aaa  1.01  xxx  111
1  bbb  2.02  yyy  222
2  ccc  3.03  zzz  333

In [164]: df.dtypes
Out[164]:
a     object
b    float64
c     object
d      int64
dtype: object

您可以找到所有数字列:

In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')

In [167]: df[num_cols]
Out[167]:
      b    d
0  1.01  111
1  2.02  222
2  3.03  333

并仅将StandardScaler应用于这些数字列:

In [168]: scaler = StandardScaler()

In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])

In [170]: df
Out[170]:
     a         b    c         d
0  aaa -1.224745  xxx -1.224745
1  bbb  0.000000  yyy  0.000000
2  ccc  1.224745  zzz  1.224745

现在你可以“一个热编码”分类(非数字)列......

答案 1 :(得分:5)

我会使用FeatureUnion。然后,我通常会做类似的事情,假设您在管道中对分类变量进行虚拟编码,而不是之前使用Pandas:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

numeric = [list of numeric column names]
categorical = [list of categorical column names]

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
    ])),
    ('model', KNeighborsClassifier())
])

您可以进一步查看Sklearn Pandas,这也很有趣。

答案 2 :(得分:1)

由于您已使用pd.get_dummies将分类功能转换为虚拟对象,因此您无需使用OneHotEncoder。因此,您的管道应该是:

from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion

knn=KNeighborsClassifier()

pipeline=Pipeline(steps= [
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', knn)
    ]
)

答案 3 :(得分:0)

另一种方法是

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame()
df['col1'] = np.random.randint(1,20,10)
df['col2'] = np.random.randn(10)
df['col3'] = list(5*'Y' + 5*'N')
numeric_cols = list(df.dtypes[df.dtypes != 'object'].index)
df.loc[:,numeric_cols] = scaler.fit_transform(df.loc[:,numeric_cols])