难道SVM二元分类器不应该从训练集中理解阈值吗?

时间:2016-04-04 14:53:10

标签: apache-spark classification svm libsvm

我对SVM分类器非常困惑,如果我听起来很愚蠢,我很抱歉。 我使用Spark库来创建java http://spark.apache.org/docs/latest/mllib-linear-methods.html,这是线性支持向量机段落中的第一个示例。在这个训练集上:

1 1:10
1 1:9
1 1:9
1 1:9
0 1:1
1 1:8
1 1:8
0 1:2
0 1:2
0 1:3

对值的预测:8,2和1都是正数(1)。鉴于训练集,我认为它们是积极的,消极的,消极的。它仅对0或负值给出否定。我读到标准阈值是"肯定"如果预测是正双,"否定"如果它是否定的,我已经看到有一种手动设置阈值的方法。但这不是我需要二元分类器的确切原因吗?我的意思是,如果我事先知道阈值是什么,我可以区分正值和负值,那么为什么还要训练分类器?

更新: 使用来自不同库的这个python代码:

X = [[10], [9],[9],[9],[1],[8],[8],[2],[2],[3]]
y = [1,1,1,1,0,1,1,0,0,0]
​
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
​
# we convert our list of lists in numpy arrays
X = np.array(X)
y = np.array(y)
# we compute the general accuracy of the system - we need more "false questions" to continue the study
accuracy = []
​
#we do 10 fold cross-validation - to be sure to test all possible combination of training and test
kf_total = StratifiedKFold(y, n_folds=5, shuffle=True)
for train, test in kf_total:
    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]
    print X_train
    clf = SVC().fit(X_train, y_train) 
    y_pred = clf.predict(X_test)
    print "the classifier says: ", y_pred
    print "reality is:          ", y_test
    print accuracy_score(y_test, y_pred)
    print ""
    accuracy.append(accuracy_score(y_test, y_pred))

print sum(accuracy)/len(accuracy)

结果是正确的:

######
1 [0]
######
2 [0]
######
8 [1]

所以我认为SVM分类器可以自己理解阈值;我怎样才能对火花库做同样的事情?

已解决:我解决了将示例更改为此问题的问题:

SVMWithSGD std = new SVMWithSGD();
std.setIntercept(true);
final SVMModel model = std.run(training.rdd());

由此:

final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);

"拦截"的标准值是假的,这是我需要的。

1 个答案:

答案 0 :(得分:0)

If you search for probability calibration you will find some research on a related matter (recalibrating the outputs to return better scores).

If your problem is a binary classification problem, you can calculate the slope of the cost by assigning vales to true/false positive/negative options multiplied by the class ratio. You can then form a line with the given AUC curve that intersects at only one point to find a point that is in some sense optimal as a threshold for your problem.

Threshold is one value that will differentiate classes .