scikit管道FeatureUnion的尺寸不匹配错误

时间:2016-07-31 14:54:54

标签: machine-learning scikit-learn

这是我的第一篇文章。我一直在尝试将功能与FeatureUnion和Pipeline结合起来,但是当我添加一个tf-idf + svd piepline时,测试失败并出现'dimension mismatch'错误。我的简单任务是创建一个回归模型来预测搜索相关性。代码和错误报告如下。我的代码中有什么问题吗?

df = read_tsv_data(input_file)
df = tokenize(df)

df_train, df_test = train_test_split(df, test_size = 0.2, random_state=2016)
x_train = df_train['sq'].values
y_train = df_train['relevance'].values

x_test = df_test['sq'].values
y_test = df_test['relevance'].values

# char ngrams
char_ngrams = CountVectorizer(ngram_range=(2,5), analyzer='char_wb', encoding='utf-8')

# TFIDF word ngrams
tfidf_word_ngrams = TfidfVectorizer(ngram_range=(1, 4), analyzer='word', encoding='utf-8')

# SVD
svd = TruncatedSVD(n_components=100, random_state = 2016)

# SVR
svr_lin = SVR(kernel='linear', C=0.01)

pipeline = Pipeline([
        ('feature_union', 
            FeatureUnion(
                transformer_list = [
                    ('char_ngrams', char_ngrams),
                    ('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
                    ('tfidf_word_ngrams', tfidf_word_ngrams), 
                    ('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
                ]                            
            )

        ),
        ('svr_lin', svr_lin)
    ])
model = pipeline.fit(x_train, y_train)
y_pred = model.predict(x_test)

将以下管道添加到FeatureUnion列表时:

('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))

生成以下例外:

    2016-07-31 10:34:08,712 : Testing ... Test Shape: (400,) - Training Shape: (1600,)
    Traceback (most recent call last):
      File "src/model/end_to_end_pipeline.py", line 236, in <module>
        main()
      File "src/model/end_to_end_pipeline.py", line 233, in main
        process_data(input_file, output_file)
      File "src/model/end_to_end_pipeline.py", line 175, in process_data
        y_pred = model.predict(x_test)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
        out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 203, in predict
        Xt = transform.transform(Xt)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 523, in transform
        for name, trans in self.transformer_list)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
        while self.dispatch_one_batch(iterator):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
        self._dispatch(tasks)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
        job = ImmediateComputeBatch(batch)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
        self.results = batch()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
        return [func(*args, **kwargs) for func, args, kwargs in self.items]
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 399, in _transform_one
        return transformer.transform(X)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
        out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 291, in transform
        Xt = transform.transform(Xt)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/decomposition/truncated_svd.py", line 201, in transform
        return safe_sparse_dot(X, self.components_.T)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 179, in safe_sparse_dot
        ret = a * b
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/sparse/base.py", line 389, in __mul__
        raise ValueError('dimension mismatch')
    ValueError: dimension mismatch

1 个答案:

答案 0 :(得分:0)

如果您将第二个svd使用率更改为新的svd会怎样?

transformer_list = [
    ('char_ngrams', char_ngrams),
    ('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
    ('tfidf_word_ngrams', tfidf_word_ngrams), 
    ('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, clone(svd)))
]  

似乎您的问题发生了,因为您使用了相同的对象2次。我第一次使用CountVectorizer,第二次使用TfidfVectorizer(反之亦然),并且在调用整个管道的预测之后,这个svd对象无法理解CountVectorizer的输出,因为它被安装在或者TfidfVectorizer的输出上(或者再次,反之亦然)。