如何在Scikit中实现自定义类型选择器学习管道

时间:2018-03-20 18:37:30

标签: python scikit-learn

我有Pipeline

transformer = Pipeline([
    ('features', FeatureUnion(transformer_list=[       
        ('numericals', Pipeline([
            ('selector', TypeSelector(np.number)),
            ('scaler', StandardScaler()),
        ])),       
        ('categoricals', Pipeline([
            ('selector', TypeSelector('category')),
            ('labeler', StringIndexer()),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])) 
    ])),
    ('feature_selection', SelectFromModel(LinearSVC())),
    ('classifier', SVC(decision_function_shape='ovo'))
])

这是TypeSelector的实现:

class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

class StringIndexer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.apply(lambda s: s.cat.codes.replace(
            {-1: len(s.cat.categories)}
        ))

现在培训和预测工作正常。但我想要功能名称。我在github上读了很多但没什么用。我试过这样的事情:

transformer.named_steps['features'].get_feature_names()

但我仍然得到这个AttributeError: Transformer numericals (type Pipeline) does not provide get_feature_names.如何实现该自定义类型选择器?

1 个答案:

答案 0 :(得分:2)

这很复杂,因为sklearn没有为Pipeline提供get_feature_names。缩放器和您的自定义变换器也不提供功能名称。有几张票可以解决此问题(请参阅例如https://github.com/scikit-learn/scikit-learn/issues/6424https://github.com/scikit-learn/scikit-learn/issues/6425)。

有两种可能的解决方法:

1)手动构建功能名称。您需要考虑SelectFromModel索引以及之前的Pipeline中的要素名称。

2)使用库。我们为此目的创建了https://github.com/TeamHG-Memex/eli5;它支持从Pipeline,FeatureUnion和许多内置变换器获取功能名称,您可以扩展它以支持变换器或缺少sklearn变换器。请参阅eli5.transform_feature_nameshttps://eli5.readthedocs.io/en/latest/libraries/sklearn.html#transformation-pipelines

它不会开箱即用:您必须至少为自定义变换器注册转换函数 - 或者为变换器实现.get_feature_names方法。用法应该是这样的(对不起,它更像是伪代码,我还没有真正检查过它):

from eli5 import transform_feature_names

@transform_feature_names.register(StringIndexer)
def indexer_feature_names(transformer, in_names=None):
    assert is_names is not None  # don't handle it for now
    return ["StringIndexer(%s)" % name for name in in_names]
    # or just pass input feature names as-is
    # return in_names   

# .. something similar for TypeSelector, and for OneHotEncoder as well

feature_names = transform_feature_names(transformer, list(df.columns))