如何在scikit-learn中计算相关的交叉验证分数?

时间:2016-08-02 04:45:41

标签: python python-3.x machine-learning scikit-learn

我正在做分类任务。然而,我的结果略有不同:

 -(NSMutableArray *)arrayByAddingString:(NSString *)string toArray:(NSMutableArray *)array
   return [array addObject:string];

  }

输出:

#First Approach
kf = KFold(n=len(y), n_folds=10, shuffle=True, random_state=False)
pipe= make_pipeline(SVC())
for train_index, test_index in kf:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

print ('Precision',np.mean(cross_val_score(pipe, X_train, y_train, scoring='precision')))



#Second Approach
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print ('Precision:', precision_score(y_test, y_pred,average='binary'))

#Third approach
pipe= make_pipeline(SCV())
print('Precision',np.mean(cross_val_score(pipe, X, y, cv=kf, scoring='precision')))

#Fourth approach

pipe= make_pipeline(SVC())
print('Precision',np.mean(cross_val_score(pipe, X_train, y_train, cv=kf, scoring='precision')))

所以,我的问题是上述哪种方法计算cross validated metrics是正确的?我认为我的分数受到污染,因为我对何时执行交叉验证感到困惑。那么,任何关于如何正确执行交叉验证分数的想法?。

更新

在培训步骤中进行评估?

Precision: 0.780422106837
Precision: 0.782051282051
Precision: 0.801544091998

/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1431                                               train, test, verbose, None,
   1432                                               fit_params)
-> 1433                       for train, test in cv)
   1434     return np.array(scores)[:, 0]
   1435 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564 
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181 
    182     def get(self):

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):

/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1522     start_time = time.time()
   1523 
-> 1524     X_train, y_train = _safe_split(estimator, X, y, train)
   1525     X_test, y_test = _safe_split(estimator, X, y, test, train)
   1526 

/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _safe_split(estimator, X, y, indices, train_indices)
   1589                 X_subset = X[np.ix_(indices, train_indices)]
   1590         else:
-> 1591             X_subset = safe_indexing(X, indices)
   1592 
   1593     if y is not None:

/usr/local/lib/python3.5/site-packages/sklearn/utils/__init__.py in safe_indexing(X, indices)
    161                                    indices.dtype.kind == 'i'):
    162             # This is often substantially faster than X[indices]
--> 163             return X.take(indices, axis=0)
    164         else:
    165             return X[indices]

IndexError: index 900 is out of bounds for size 900

2 个答案:

答案 0 :(得分:4)

对于任何分类任务,使用StratifiedKFold交叉验证分割始终是好的。在分层KFold中,每个类别的样本数量与您的分类问题相同。

[link](http://nl.mathworks.com/help/examples/matlab/PlotMultipleHistogramsExample_01.png)

那么这取决于您的分类问题类型。总是很高兴看到精确度和召回分数。如果是二元分类偏差,人们倾向于使用ROC AUC分数:

from sklearn import metrics
metrics.roc_auc_score(ytest, ypred)

让我们看看你的解决方案:

import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import precision_score
from sklearn.cross_validation import KFold
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

np.random.seed(1337)

X = np.random.rand(1000,5)

y = np.random.randint(0,2,1000)

kf = KFold(n=len(y), n_folds=10, shuffle=True, random_state=42)
pipe= make_pipeline(SVC(random_state=42))
for train_index, test_index in kf:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

print ('Precision',np.mean(cross_val_score(pipe, X_train, y_train, scoring='precision')))
# Here you are evaluating precision score on X_train.

#Second Approach
clf = SVC(random_state=42)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print ('Precision:', precision_score(y_test, y_pred, average='binary'))

# here you are evaluating precision score on X_test

#Third approach
pipe= make_pipeline(SVC())
print('Precision',np.mean(cross_val_score(pipe, X, y, cv=kf, scoring='precision')))

# Here you are splitting the data again and evaluating mean on each fold

因此,结果不同

答案 1 :(得分:3)

首先,正如documentation中所述,并在某些examples中显示,scikit-learn交叉验证cross_val_score执行以下操作:

  1. 将数据集X拆分为N个折叠(根据参数cv)。它会相应地拆分标签y
  2. 使用估算器(参数estimator)在N-1之前的折叠上训练它。
  3. 使用估算器预测最后一次折叠的标签。
  4. 通过比较预测值和真实值
  5. 返回分数(参数scoring
  6. 通过更改测试折叠重复步骤2.到步骤4.因此,你最终得到一组N分数。
  7. 让我们来看看你的每一个方法。

    第一种方法:

    为什么你会在cross_validation之前拆分训练集,因为scikit-learn函数会为你做这个?因此,您可以使用较少的数据训练模型,并以值得验证的分数结束

    第二种方法:

    在此处,您在数据上使用的是cross_validation_sore以外的其他指标。因此,您无法将其与其他验证分数进行比较 - 因为它们是两个不同的东西。一个是经典的误差百分比,而precision是用于校准二元分类器(真或假)的度量。这是一个很好的指标(您可以检查ROC曲线,精确度和召回指标),但只比较这些指标。

    第三种方法:

    这个更自然。这个分数是分(我的意思是,如果你想将它与其他分类器/估算器进行比较)。但是,我会警告你不要直接采取平均值。因为有两件事你可以比较:平均值,也可以是方差。数组的每个分数都是不同的,你可能想知道与其他估算者相比多少(你肯定希望你的方差尽可能小)

    第四种方法:

    Kfold

    无关的cross_val_score似乎存在问题

    <强>最后:

    仅使用第二种 OR 第三种方法来比较估算器。但他们肯定不会估计同样的事情 - 精确度与错误率。

    clf = make_pipeline(SVC())
    # However, fot clf, you can use whatever estimator you like
    scores = cross_val_score(clf, X, y, cv = 10, scoring='precision')
    print('Mean score : ', np.mean(scores))
    print('Score variance : ', np.var(scores))
    

    通过将clf更改为另一个估算器(或将其整合到循环中),您可以为每个估算器获得分数并进行比较