Question

对于以下pandas DataFrame df，我想将type列转换为OneHotEncoding，并使用字典word将word2vec列转换为其向量表示。然后我想用count列连接两个变换后的向量，以形成分类的最终特征。

>>> df
       word type  count
0     apple    A      4
1       cat    B      3
2  mountain    C      1 

>>> df.dtypes
word       object
type     category
count       int64

>>> word2vec
{'apple': [0.1, -0.2, 0.3], 'cat': [0.2, 0.2, 0.3], 'mountain': [0.4, -0.2, 0.3]}

我定义了自定义Transformer，并使用FeatureUnion来连接这些功能。

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder

class w2vTransformer(TransformerMixin):

    def __init__(self,word2vec):
        self.word2vec = word2vec

    def fit(self,x, y=None):
        return self

    def wv(self, w):
        return self.word2vec[w] if w in self.word2vec else [0, 0, 0]

    def transform(self, X, y=None):
         return df['word'].apply(self.wv)

pipeline = Pipeline([
    ('features', FeatureUnion(transformer_list=[
        # Part 1: get integer column
        ('numericals', Pipeline([
            ('selector', TypeSelector(np.number)),
        ])),

        # Part 2: get category column and its onehotencoding
        ('categoricals', Pipeline([
            ('selector', TypeSelector('category')),
            ('labeler', StringIndexer()),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])), 

        # Part 3: transform word to its embedding
        ('word2vec', Pipeline([
            ('w2v', w2vTransformer(word2vec)),
        ]))
    ])),
])

当我运行pipeline.fit_transform(df)时，我收到了错误：blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 1, expected 3.

但是，如果我从管道删除 word2vec Transformer（第3部分），管道（Part1 1 + Part 2）工作正常。

>>> pipeline_no_word2vec.fit_transform(df).todense() matrix([[4., 1., 0., 0.], [3., 0., 1., 0.], [1., 0., 0., 1.]])

如果我只保留管道中的w2v变压器，它也可以工作。

>>> pipeline_only_word2vec.fit_transform(df) array([list([0.1, -0.2, 0.3]), list([0.2, 0.2, 0.3]), list([0.4, -0.2, 0.3])], dtype=object)

我的猜测是我的w2vTransformer课程有问题，但不知道如何修复它。请帮忙。

Answer 1

此错误是由于FeatureUnion期望每个部分都有一个二维数组。

现在，FeatureUnion的前两部分： - 'numericals'和'categoricals'正确发送形状的二维数据（n_samples，n_features）。

示例数据中的

n_samples = 3。 n_features将取决于各个部分（例如OneHotEncoder将在第二部分中更改它们，但在第一部分中将为1）。

但第三部分'word2vec'返回一个具有1-d形状(3,)的pandas.Series对象。 FeatureUnion默认采用这个形状（1,3），因此抱怨它与其他块不匹配。

所以你需要纠正这种形状。

现在即使你只是在最后做reshape()并将其更改为shape（3,1），你的代码也不会运行，因为该数组的内部内容是来自word2vec dict的列表，未正确转换为二维数组。相反，它将成为一系列列表。

更改w2vTransformer以更正错误：

class w2vTransformer(TransformerMixin):
    ...
    ...
    def transform(self, X, y=None):
        return np.array([np.array(vv) for vv in X['word'].apply(self.wv)])

之后，管道将运行。

在pandas数据帧上自定义word2vec Transformer并在FeatureUnion中使用它

1 个答案: