这是我的一大块Python(2.7,[我学习了Python 3所以使用未来的print_function来获取我习惯使用的打印格式])使用来自scikit的学习代码 - 从以前的几个版本中学习由于企业IT政策,我陷入了困境。它使用SVC引擎。我不明白的是,我得到的+/- 1情况的结果在第一个(使用simple_clf)和第二个之间是不同的。但从结构上来说,我认为它们与第一次处理和整个数据阵列同时相同,而第二次只是一次使用数据1个数据。但结果并不一致。为平均(平均)分数生成的值应为十进制百分比(0.0到1.0)。在某些情况下,差异很小,但其他差异足以让我问我的问题。
from __future__ import print_function
import os
import numpy as np
from numpy import array, loadtxt
from sklearn import cross_validation, datasets, svm, preprocessing, grid_search
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score
GRADES = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'M']
# Initial processing
featurevecs = loadtxt( FEATUREVECFILE )
f = open( SCORESFILE )
scorelines = f.readlines()[ 1: ] # Skip header line
f.close()
scorenums = [ GRADES.index( l.split( '\t' )[ 1 ] ) for l in scorelines ]
scorenums = array( scorenums )
# Need this step to normalize the feature vectors
scaler = preprocessing.Scaler()
scaler.fit( featurevecs )
featurevecs = scaler.transform( featurevecs )
# Break up the vector into a training and testing vector
# Need to keep the training set somewhat large to get enough of the
# scarce results in the training set or the learning fails
X_train, X_test, y_train, y_test = train_test_split(
featurevecs, scorenums, test_size = 0.333, random_state = 0 )
# Define a range of parameters we can use to do a grid search
# for the 'best' ones.
CLFPARAMS = {'gamma':[.0025, .005, 0.09, .01, 0.011, .02, .04],
'C':[200, 300, 400, 500, 600]}
# do a simple cross validation
simple_clf = svm.SVC()
simple_clf = grid_search.GridSearchCV( simple_clf, CLFPARAMS, cv = 3 )
simple_clf.fit( X_train, y_train )
y_true, y_pred = y_test, simple_clf.predict( X_test )
match = 0
close = 0
count = 0
deviation = []
for i in range( len( y_true ) ):
count += 1
delta = np.abs( y_true[ i ] - y_pred[ i ] )
if( delta == 0 ):
match += 1
elif( delta == 1 ):
close += 1
deviation = np.append( deviation,
float( np.sum( np.abs( delta ) <= 1 ) ) )
avg = float( match ) / float( count )
close_avg = float( close ) / float( count )
#deviation.mean() = avg + close_avg
print( '{0} Accuracy (+/- 0) {1:0.4f} Accuracy (+/- 1) {2:0.4f} (+/- {3:0.4f}) '.format( test_type, avg, deviation.mean(), deviation.std() / 2.0, ), end = "" )
# "Original" code
# do LeaveOneOut item by item
clf = svm.SVC()
clf = grid_search.GridSearchCV( clf, CLFPARAMS, cv = 3 )
toleratePara = 1;
thecurrentScoreGraded = []
loo = cross_validation.LeaveOneOut( n = len( featurevecs ) )
for train, test in loo:
try:
clf.fit( featurevecs[ train ], scorenums[ train ] )
rawPredictionResult = clf.predict( featurevecs[ test ] )
errorVec = scorenums[ test ] - rawPredictionResult;
print( len( errorVec ), errorVec )
thecurrentScoreGraded = np.append( thecurrentScoreGraded, float( np.sum( np.abs( errorVec ) <= toleratePara ) ) / len( errorVec ) )
except ValueError:
pass
print( '{0} Accuracy (+/- {1:d}) {2:0.4f} (+/- {3:0.4f})'.format( test_type, toleratePara, thecurrentScoreGraded.mean(), thecurrentScoreGraded.std() / 2 ) )
以下是我的结果,您可以看到它们不匹配。我的实际工作任务是看看是否准确地改变了为收集学习引擎而收集的数据类型将有助于准确性,或者即使将数据组合到更大的教学向量中也会有所帮助,因此您可以看到我正在研究一堆组合。每对线用于一种学习数据。第一行是我的结果,第二行是基于“原始”代码的结果。
original Accuracy (+/- 0) 0.2771 Accuracy (+/- 1) 0.6024 (+/- 0.2447)
original Accuracy (+/- 1) 0.6185 (+/- 0.2429)
upostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384)
upostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
npostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384)
npostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
tancurv Accuracy (+/- 0) 0.2330 Accuracy (+/- 1) 0.5825 (+/- 0.2466)
tancurv Accuracy (+/- 1) 0.5831 (+/- 0.2465)
npostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199)
npostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
nposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466)
nposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199)
upostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
uposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466)
uposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293)
upos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
npos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293)
npos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
curv Accuracy (+/- 0) 0.1553 Accuracy (+/- 1) 0.4854 (+/- 0.2499)
curv Accuracy (+/- 1) 0.5570 (+/- 0.2484)
tan Accuracy (+/- 0) 0.3107 Accuracy (+/- 1) 0.7184 (+/- 0.2249)
tan Accuracy (+/- 1) 0.7231 (+/- 0.2237)
答案 0 :(得分:0)
你是什么意思&#34;结构上他们是相同的&#34;? 您使用不同的子集进行训练和测试,并且它们具有不同的大小。 如果您没有使用完全相同的培训数据,我不明白为什么您希望结果相同。
不过,顺便看看the note on LOO in the documentation。 LOO可能有很大的差异。