sklearn的SVC评分方法需要什么样的输入?

时间:2014-06-30 19:57:43

标签: python scikit-learn

所以我正在尝试构建一个分类器并对其性能进行评分。这是我的代码:

def svc(train_data, train_labels, test_data, test_labels):
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score
    svc = SVC(kernel='linear')
    svc.fit(train_data, train_labels)
    predicted = svc.predict(test_data)
    actual = test_labels
    score = svc.score(test_data, test_labels)
    print ('svc score')
    print (score)
    print ('svc accuracy')
    print (accuracy_score(predicted, actual))

现在我用:

运行函数svc(X,x,Y,y)
X.shape = (1000, 150)    
x.shape = (1000, )   
Y.shape = (200, 150)   
y.shape = (200, )

我收到错误:

      6     predicted = svc.predict(test_classed_data)
      7     actual = test_classed_labels
----> 8     score = svc.score(test_classed_data, test_classed_labels)
      9     print ('svc score')
     10     print (score)

local/lib/python3.4/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    289         """
    290         from .metrics import accuracy_score
--> 291         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    292 
    293 

    124     if (y_type not in ["binary", "multiclass", "multilabel-indicator",
    125                        "multilabel-sequences"]):
--> 126         raise ValueError("{0} is not supported".format(y_type))
    127 
    128     if y_type in ["binary", "multiclass"]:

ValueError: continuous is not supported

我的test_labels或y的格式为:

[ 15.5  15.5  15.5  15.5  15.5  15.5  15.5  15.5  15.5  15.5  15.5  20.5
  20.5  20.5  20.5  20.5  20.5  20.5  20.5  20.5  20.5  20.5  25.5  25.5
  25.5  25.5  25.5  25.5  25.5  25.5  25.5  25.5  25.5  30.5  30.5  30.5
  30.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5  35.5  35.5  35.5  35.5
  35.5  35.5  35.5  35.5  35.5  35.5  35.5... ]

我真的很困惑,为什么当我看过的所有示例都有类似的格式来开采和工作时,SVC不会将这些识别为离散标签。请帮忙。

2 个答案:

答案 0 :(得分:5)

yfit函数中的score应为整数或字符串,表示类标签。

E.g。如果你有两个课程"foo"1,你可以像这样训练一个SVM:

>>> from sklearn.svm import SVC
>>> clf = SVC()
>>> X = np.random.randn(10, 4)
>>> y = ["foo"] * 5 + [1] * 5
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

然后用

测试其准确性
>>> X_test = np.random.randn(6, 4)
>>> y_test = ["foo", 1] * 3
>>> clf.score(X_test, y_test)
0.5

fit显然仍然接受浮点值,但它们不应该是,因为类标签不应该是实际值。

答案 1 :(得分:1)

来自http://scikit-learn.org/stable/modules/svm.html#classification的SVM的scikit-learn文档:

"与其他分类器一样,SVC,NuSVC和LinearSVC将两个数组作为输入:大小为[n_samples,n_features]的数组X保存训练样本,数组Y为整数值"

将标签数组转换为int,或者如果过于简单(例如1.6和1.8将转换为相同的值),则为每个唯一的浮点值指定一个整数类标签。

不确定为什么fitpredict方法不会抛出错误。