我有Pipeline
:
transformer = Pipeline([
('features', FeatureUnion(transformer_list=[
('numericals', Pipeline([
('selector', TypeSelector(np.number)),
('scaler', StandardScaler()),
])),
('categoricals', Pipeline([
('selector', TypeSelector('category')),
('labeler', StringIndexer()),
('encoder', OneHotEncoder(handle_unknown='ignore')),
]))
])),
('feature_selection', SelectFromModel(LinearSVC())),
('classifier', SVC(decision_function_shape='ovo'))
])
这是TypeSelector的实现:
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
class StringIndexer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.apply(lambda s: s.cat.codes.replace(
{-1: len(s.cat.categories)}
))
现在培训和预测工作正常。但我想要功能名称。我在github上读了很多但没什么用。我试过这样的事情:
transformer.named_steps['features'].get_feature_names()
但我仍然得到这个AttributeError: Transformer numericals (type Pipeline) does not provide get_feature_names.
如何实现该自定义类型选择器?
答案 0 :(得分:2)
这很复杂,因为sklearn没有为Pipeline提供get_feature_names。缩放器和您的自定义变换器也不提供功能名称。有几张票可以解决此问题(请参阅例如https://github.com/scikit-learn/scikit-learn/issues/6424,https://github.com/scikit-learn/scikit-learn/issues/6425)。
有两种可能的解决方法:
1)手动构建功能名称。您需要考虑SelectFromModel索引以及之前的Pipeline中的要素名称。
2)使用库。我们为此目的创建了https://github.com/TeamHG-Memex/eli5;它支持从Pipeline,FeatureUnion和许多内置变换器获取功能名称,您可以扩展它以支持变换器或缺少sklearn变换器。请参阅eli5.transform_feature_names和https://eli5.readthedocs.io/en/latest/libraries/sklearn.html#transformation-pipelines。
它不会开箱即用:您必须至少为自定义变换器注册转换函数 - 或者为变换器实现.get_feature_names方法。用法应该是这样的(对不起,它更像是伪代码,我还没有真正检查过它):
from eli5 import transform_feature_names
@transform_feature_names.register(StringIndexer)
def indexer_feature_names(transformer, in_names=None):
assert is_names is not None # don't handle it for now
return ["StringIndexer(%s)" % name for name in in_names]
# or just pass input feature names as-is
# return in_names
# .. something similar for TypeSelector, and for OneHotEncoder as well
feature_names = transform_feature_names(transformer, list(df.columns))