sci-kit中的交叉验证和管道学习

时间:2015-04-16 09:24:16

标签: python machine-learning scikit-learn classification

对于机器学习项目,我试图使用从文本中提取的特征来预测分类结果变量。

使用交叉验证,我将X和Y分成测试集和训练集。使用管道训练训练集。但是,当我使用测试集中的X计算性能时,我的性能为0.0。虽然目前尚未从X_test中提取任何功能。

是否可以在管道中拆分数据集?

我的代码:

X, Y = read_data('development2.csv')

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

train_pipeline = Pipeline([('vect', CountVectorizer()), #ngram_range=(1,2), analyzer='word'
                 ('tfidf', TfidfTransformer(use_idf=False)),
                 ('clf', OneVsRestClassifier(SVC(kernel='linear', probability=True))),
                 ])

train_pipeline.fit(X_train, Y_train)

predicted = train_pipeline.predict(X_test)

print accuracy_score(Y_test, predicted)

使用SVC时的追溯:

File     "/Users/Robbert/Documents/pipeline.py", line     62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)

1 个答案:

答案 0 :(得分:0)

我解决了这个问题。

目标变量(Y)没有适当的格式。变量的存储方式如下:[[0 0 0 0 1],[0 0 1 0 0]]。我将其转换为不同的数组格式,如下所示:[5, 3]

这对我有用。

感谢所有答案。