Scikit-学习与不同得分者的学习曲线,并留下一组cv产生相同的值

时间:2016-12-04 21:02:30

标签: python machine-learning scikit-learn svm

我想在训练有素的SVM分类器上绘制学习曲线,使用不同的 得分,并使用Leave One Group Out作为交叉验证的方法。一世 我以为我已经想通了,但两个不同的得分手 - 'f1_micro'和 '准确性' - 将产生相同的价值。我很困惑,是假设的 是这样吗?

这是我的代码(遗憾的是我无法共享数据,因为它未打开):

template<typename T, typename I>
__device__
I upper_bound_index(T const* data,
                    I        count,
                    T const& value) {
    I start = 0;
    while( count > 0 ) {
        I step = count / 2;
        if( !(value < data[start + step]) ) {
            start += step + 1;
            count -= step + 1;
        } else {
            count = step;
        }
    }
    return start;
}

__global__
void group_kernel(int                           numGroups,
                  int       const* __restrict__ cumulativeGroupThreadCount,
                  GroupData const* __restrict__ groupData) {
    int gridThreadID = blockIdx.x*blockDim.x + threadIdx.x;
    int groupID = upper_bound_index(cumulativeGroupThreadCount,
                                    numGroups,
                                    gridThreadID);
    if( groupID == numGroups ) {
        // Excess threads
        return;
    }
    int itemID = gridThreadID - (groupID > 0 ?
                                 cumulativeGroupThreadCount[groupID-1] :
                                 0);
    GroupData data = groupData[groupID];
    // ...
}

现在,来自:

from sklearn import svm
SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, class_weight=None,
  coef0=0.0, decision_function_shape=None, degree=3, gamma=0.01,  
  kernel='rbf', max_iter=-1, probability=False, random_state=1, 
  shrinking=True, tol=0.001, verbose=False)
training_data = pd.read_csv('training_data.csv')
X = training_data.drop(['Groups', 'Targets'], axis=1).values
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = training_data['Targets'].values
groups = training_data["Groups"].values
Fscorer = make_scorer(f1_score, average = 'micro')
logo = LeaveOneGroupOut()
parm_range0 = np.logspace(-2, 6, 9)
train_scores0, test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X,
  y, "C", parm_range0, cv =logo.split(X, y, groups=groups), scoring = Fscorer)

我明白了:

  

[0.20257407 0.35551122 0.40791047 0.49887676 0.5021742
  0.50030438     0.49426622 0.48066419 0.4868987]

     

0.502174200206

     

100.0

如果我创建一个新的分类器,但具有相同的参数,并运行 一切都和以前一样,除了得分,例如:

train_scores_mean0 = np.mean(train_scores0, axis=1)
train_scores_std0 = np.std(train_scores0, axis=1)
test_scores_mean0 = np.mean(test_scores0, axis=1)
test_scores_std0 = np.std(test_scores0, axis=1)
print test_scores_mean0
print np.amax(test_scores_mean0)
print  np.logspace(-2, 6, 9)[test_scores_mean0.argmax(axis=0)]

我得到完全相同的答案:

  

[0.20257407 0.35551122 0.40791047 0.49887676 0.5021742
  0.50030438     0.49426622 0.48066419 0.4868987]

     

0.502174200206

     

100.0

这怎么可能,我做错了什么,或者错过了什么?

由于

1 个答案:

答案 0 :(得分:1)

F1 = accuracy当且仅当TP = TN,即真阳性的数量等于真阴性的数量,如果你的类完全平衡,就会发生这种情况。所以就是这样,或者你的代码中有错误。你在哪里初始化你的得分手,如下:scorer = make_scorer(accuracy_score, average = 'micro')