Question

我有一个56列的熊猫数据框。大约一半的列是float，其余的是字符串（文本数据），最后col56是label列。数据集看起来像这样

Col1 Col2...Col26 Col27       Col 28   ..... Col55     Col 56
1    4      76    I like cats Cats are cool  Cat bags  1
.
.
.
1900 rows

我想同时使用数字和文本数据来运行分类算法。谷歌快速搜索显示，最好的方法是使用Feature Union

这是到目前为止的代码

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

df=pd.read_csv('url')
X=df[[Col1...Col55]]
y=df[[Col56]]
from sklearn.model_selection import train_test_split
stop_list=(i, am, the...)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
pipeline = Pipeline([
    ('union',FeatureUnion([
        ('Col1', Pipeline([
            ('selector', ItemSelector(column='Col1')),
            ('caster', ArrayCaster())
            ])),
.
.
.
.
.
        ('Col27',Pipeline([
            ('selector', ItemSelector(column='Col27')),
            ('vectorizer', CountVectorizer())
            ])), 
.
.
. 
        ('Col55',Pipeline([
            ('selector', ItemSelector(column='Col55')),
            ('vectorizer', CountVectorizer())
            ]))
])),
('model',SVC())
])

然后我得到一个错误

TypeError                                 Traceback (most recent call last)
<ipython-input-8-7a2cab7bed7d> in <module>
    167         (' Col27',Pipeline([
    168             ('selector', ItemSelector(column=' Col27')),
--> 169             ('vectorizer', CountVectorizer(stop_words=stop_list))
    170         ]))

TypeError: 'tuple' object is not callable

我不明白，因为使用了完全相同的方法here和here 而且似乎没有任何错误。我究竟做错了什么？我该如何解决？

Answer 1

我认为问题出在CountVectorizer。

    cv = CountVectorizer
    word_count_vector = cv.fit_transform(data)
    word_count_vector = cv.shape()

这会产生与您相同的错误。您实际上可以手动进行操作。使用CountVectorizer创建数据的稀疏矩阵，并通过使用scipy中的spare.hstack将其与数字数据矩阵或数据框对齐。它水平堆叠具有相等行和相等/不同列的两个矩阵。

如何修复要素联合和管道中的元组对象错误（使用sklearn时）？

1 个答案: