Sklearn Pipeline:在ColumnTransformer中的OneHotEncode之后获取功能名称

时间:2019-02-12 09:27:07

标签: python scikit-learn

我希望在适应管道之后获得功能名称。

categorical_features = ['brand', 'category_name', 'sub_category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

numeric_features = ['num1', 'num2', 'num3', 'num4']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

然后

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', GradientBoostingRegressor())])

在适应pandas数据框后,我可以从

获得功能重要性

clf.steps[1][1].feature_importances_

我尝试了clf.steps[0][1].get_feature_names(),但遇到了错误

AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.

如何从中获取功能名称?

3 个答案:

答案 0 :(得分:3)

您可以使用以下代码段访问功能名称!

clf.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names()

可复制的示例:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

df = pd.DataFrame({'brand'      : ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
                   'category'   : ['asdf','asfa','asdfas','as'], 
                   'num1'       : [1, 1, 0, 0] ,
                   'label'      : [0,0,0,1]})



numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', GradientBoostingRegressor())])
clf.fit(df,df['label'])

clf.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names()

#
array(['x0_NaN', 'x0_aaaa', 'x0_asdfasdf', 'x0_sadfds', 'x1_as',
   'x1_asdf', 'x1_asdfas', 'x1_asfa'], dtype=object)

答案 1 :(得分:2)

编辑:实际上,彼得的评论答案在ColumnTransformer doc中:

  

变换后的特征矩阵中的列顺序遵循在转换器列表中指定列的顺序。除非在passthrough关键字中指定,否则未指定的原始要素矩阵的列将从生成的转换后的要素矩阵中删除。用passthrough指定的那些列将添加到转换器的输出的右侧。


要用Paul在评论中要求的内容来完成Venkatachalam的回答,列在ColumnTransformer .get_feature_names()方法中的功能名称的顺序取决于在ColumnTransformer实例上对steps变量进行声明的顺序。

我找不到任何文档,所以我只玩了下面的玩具示例,这让我理解了逻辑。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler

class testEstimator(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        return self

    def transform(self,X):
        return np.full(X.shape, self.string).reshape(-1,1)

    def get_feature_names(self):
        return self.string

transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)

dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)

for name,step in pipeline.named_steps.items():
    if hasattr(step, 'get_feature_names'):
        print(step.get_feature_names())

为了有一个更具代表性的示例,我添加了一个RobustScaler并将ColumnTransformer嵌套在管道上。顺便说一句,您将找到我的Venkatachalam版本,以获取步骤的功能名称循环。您可以通过列表理解来解压缩名称,从而将其变成一个稍微有用的变量:

[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]

因此,请与dt_test和估算器一起使用,以了解如何构建功能名称以及如何在get_feature_names()中将其连接在一起。 这是另一个使用变压器的示例,该变压器使用输入列输出2列:

class testEstimator3(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        self.unique = np.unique(X)[0]
        return self

    def transform(self,X):
        return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)

    def get_feature_names(self):
        return list((self.unique,self.string))

dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)

transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)

pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
    if hasattr(step[1], 'get_feature_names'):
        print(step[1].get_feature_names())

答案 2 :(得分:0)

如果您正在寻找如何在连续管道之后访问列名称(最后一个是 ColumnTransformer),您可以按照以下示例访问它们:

full_pipeline 中有两个管道 genderrelevent_experience

full_pipeline = ColumnTransformer([
    ("gender", gender_encoder, ["gender"]),
    ("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])

gender 管道如下所示:

gender_encoder = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ("cat", OneHotEncoder())
])

拟合 full_pipeline 后,您可以使用以下代码段访问列名称

full_pipeline.transformers_[0][1][1].get_feature_names()

就我而言,输出是: array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)