带MultilabelBinarizer的sklearn ColumnTransformer

时间:2019-12-09 18:29:10

标签: python python-3.x scikit-learn pipeline

我想知道是否可以在ColumnTransformer中使用MultilabelBinarizer。

我有一个玩具熊猫数据框,如:

df = pd.DataFrame({"id":[1,2,3], 
"text": ["some text", "some other text", "yet another text"], 
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]]})

preprocess = ColumnTransformer(
    [
     ('vectorizer', CountVectorizer(), 'text'),
    ('binarizer', MultiLabelBinarizer(), ['label']),

    ],
    remainder='drop')

但是,此代码引发异常:

~/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    714     with _print_elapsed_time(message_clsname, message):
    715         if hasattr(transformer, 'fit_transform'):
--> 716             res = transformer.fit_transform(X, y, **fit_params)
    717         else:
    718             res = transformer.fit(X, y, **fit_params).transform(X)

TypeError: fit_transform() takes 2 positional arguments but 3 were given

使用OneHotEncoder,ColumnTransformer可以正常工作。

2 个答案:

答案 0 :(得分:1)

对于输入 <script> let liEls = document.querySelectorAll('ul li'); let index = 0; let max = liEls.length-1; window.show = function (increase) { if(index==max) { index=-1; } index = index + increase; liEls[index].scrollIntoView({ behavior: 'smooth' }); console.log(index); } X适合一次处理一列(因为每一行应视为一类类别),而MultiLabelBinarizer可以处理多个列列。要使OneHotEncoder兼容ColumnTransformer,您将需要遍历MultiHotEncoder的所有列,并用X来拟合/转换每一列。以下应该适用于MultiLabelBinarizer输入。

pandas.DataFrame

您应该得到:

from sklearn.base import BaseEstimator, TransformerMixin

class MultiHotEncoder(BaseEstimator, TransformerMixin):
    """Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`. Note
    that input X has to be a `pandas.DataFrame`.
    """
    def __init__(self):
        self.mlbs = list()
        self.n_columns = 0
        self.categories_ = self.classes_ = list()

    def fit(self, X:pd.DataFrame, y=None):
        for i in range(X.shape[1]): # X can be of multiple columns
            mlb = MultiLabelBinarizer()
            mlb.fit(X.iloc[:,i])
            self.mlbs.append(mlb)
            self.classes_.append(mlb.classes_)
            self.n_columns += 1
        return self

    def transform(self, X:pd.DataFrame):
        if self.n_columns == 0:
            raise ValueError('Please fit the transformer first.')
        if self.n_columns != X.shape[1]:
            raise ValueError(f'The fit transformer deals with {self.n_columns} columns '
                             f'while the input has {X.shape[1]}.'
                            )
        result = list()
        for i in range(self.n_columns):
            result.append(self.mlbs[i].transform(X.iloc[:,i]))

        result = np.concatenate(result, axis=1)
        return result

# test
temp = pd.DataFrame({
    "id":[1,2,3], 
    "text": ["some text", "some other text", "yet another text"], 
    "label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]],
    "label2": [["w", "c"], ["b", "c"], ["b", "d"]]
})

col_transformer = ColumnTransformer([
    ('one-hot', OneHotEncoder(), ['id','text']),
    ('multi-hot', MultiHotEncoder(), ['label', 'label2'])
])
col_transformer.fit_transform(temp)

请注意如何对前3列和后3列进行一次热编码,而对后5列和后4列进行多次热编码。您可以像往常一样找到类别信息:

array([[1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.]])

答案 1 :(得分:0)

在测试中我并没有特别努力地确切地了解为什么 ,但是我能够构建一个自定义的<Transformer>,它实际上是“包装” {{1} },但也与MultiLabelBinarizer兼容:

<ColumnTransformer>

我的预感是class MultiLabelBinarizerFixedTransformer(BaseEstimator, TransformerMixin): """ Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer` """ def __init__( self ): self.feature_name = ["mlb"] self.mlb = MultiLabelBinarizer(sparse_output=False) def fit(self, X, y=None): self.mlb.fit(X) return self def transform(self, X): return self.mlb.transform(X) def get_feature_names(self, input_features=None): cats = self.mlb.classes_ if input_features is None: input_features = ['x%d' % i for i in range(len(cats))] print(input_features) elif len(input_features) != len(self.categories_): raise ValueError( "input_features should have length equal to number of " "features ({}), got {}".format(len(self.categories_), len(input_features))) feature_names = [f"{input_features[i]}_{cats[i]}" for i in range(len(cats))] return np.array(feature_names, dtype=object) MultiLabelBinarizer使用的set of inputstransform()期望的不同。