One-Hot-Encode分类变量和同时缩放连续的变量

时间:2017-05-05 06:54:07

标签: python scikit-learn

我很困惑,因为如果您首先执行OneHotEncoder然后StandardScaler会出现问题,因为缩放器还会缩放先前由OneHotEncoder转换的列。有没有办法同时执行编码和缩放,然后将结果连接在一起?

4 个答案:

答案 0 :(得分:13)

当然可以。只需根据需要单独缩放和单独编码单独的列:

# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder

dataset = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(dataset.head(5))

# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale  = ['gre', 'gpa']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

# Scale and Encode Separate Columns
scaled_columns  = scaler.fit_transform(dataset[columns_to_scale]) 
encoded_columns =    ohe.fit_transform(dataset[columns_to_encode])

# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

答案 1 :(得分:2)

0.20版的Scikit学习提供了sklearn.compose.ColumnTransformer来做混合类型的列转换器。您可以缩放数字特征并一键式编码分类特征。下面是一个官方示例(您可以找到代码here):

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

警告:此方法是实验性的,发行之间的某些行为可能会发生变化而不会被弃用。

答案 2 :(得分:2)

目前有许多方法可以达到OP要求的结果。做到这一点的三种方法是

  1. np.concatenate()-请参见this answer to the OP's question, already posted

  2. scikit-learn's ColumnTransformer

  3. scikit-learn's FeatureUnion

使用@Max Power here发布的示例,下面是一个最小的工作片段,该片段可以执行OP所需的工作,并将转换后的列汇总到单个Pandas数据帧中。显示了这三种方法的输出

这3种方法的通用代码是

import numpy as np
import pandas as pd

# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder

dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale  = ['gre', 'gpa']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

方法1 。请参见代码here。要显示输出,可以使用

print pd.DataFrame(processed_data).head()

方法1的输出。

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

方法2。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


p = Pipeline([
    ('coltransformer', ColumnTransformer(transformers=[
        ('assessments', Pipeline([
            ('scale', scaler),
                                ]), columns_to_scale),
        ('ranks', Pipeline([
            ('encode', ohe),
                                ]), columns_to_encode),
                                    ]),
    ),
                ])

print(pd.DataFrame(p.fit_transform(dataset)).head())

方法2的输出。

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

方法3。

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion


class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, df):
        return df[self.key]

p = Pipeline([
    ("union", FeatureUnion(
        transformer_list=[
            ('assessments', Pipeline([
                ('selector', ItemSelector(key=columns_to_scale)),
                ('scale', scaler),
                                ])
            ),
            ('ranks', Pipeline([
                ('selector', ItemSelector(key=columns_to_encode)),
                ('encode', ohe),
                                ])
            ),
                        ]
                        )
    ),
])

print(pd.DataFrame(p.fit_transform(dataset)).head())

方法3的输出。

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

说明

  1. 方法1.已经说明。

  2. 方法2和3.接受完整的数据集,但仅对数据子集执行特定操作。修改/处理后的子集一起(合并)到最终输出中。

详细信息

pandas==0.23.4
numpy==1.15.2
scikit-learn==0.20.0

附加说明

这里显示的3种方法可能不是唯一的可能性。...我相信还有其他方法可以做到这一点。

已使用的来源

Updated link to binary.csv dataset

答案 3 :(得分:0)

无法获得您的观点,OneHotEncoder用于标称数据,StandardScaler用于数字数据。因此,您不应该将它们一起用于您的数据。