我有一个56列的熊猫数据框。大约一半的列是float,其余的是字符串(文本数据),最后col56是label列。数据集看起来像这样
Col1 Col2...Col26 Col27 Col 28 ..... Col55 Col 56
1 4 76 I like cats Cats are cool Cat bags 1
.
.
.
1900 rows
我想同时使用数字和文本数据来运行分类算法。谷歌快速搜索显示,最好的方法是使用Feature Union
这是到目前为止的代码
import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
df=pd.read_csv('url')
X=df[[Col1...Col55]]
y=df[[Col56]]
from sklearn.model_selection import train_test_split
stop_list=(i, am, the...)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
pipeline = Pipeline([
('union',FeatureUnion([
('Col1', Pipeline([
('selector', ItemSelector(column='Col1')),
('caster', ArrayCaster())
])),
.
.
.
.
.
('Col27',Pipeline([
('selector', ItemSelector(column='Col27')),
('vectorizer', CountVectorizer())
])),
.
.
.
('Col55',Pipeline([
('selector', ItemSelector(column='Col55')),
('vectorizer', CountVectorizer())
]))
])),
('model',SVC())
])
然后我得到一个错误
TypeError Traceback (most recent call last)
<ipython-input-8-7a2cab7bed7d> in <module>
167 (' Col27',Pipeline([
168 ('selector', ItemSelector(column=' Col27')),
--> 169 ('vectorizer', CountVectorizer(stop_words=stop_list))
170 ]))
TypeError: 'tuple' object is not callable
答案 0 :(得分:0)
我认为问题出在CountVectorizer。
cv = CountVectorizer
word_count_vector = cv.fit_transform(data)
word_count_vector = cv.shape()
这会产生与您相同的错误。您实际上可以手动进行操作。使用CountVectorizer创建数据的稀疏矩阵,并通过使用scipy中的spare.hstack将其与数字数据矩阵或数据框对齐。它水平堆叠具有相等行和相等/不同列的两个矩阵。