我正在开发一个程序,我有一些数据(标记和未标记)和2个不同的组(“artritis”和“fibro”)。我想获得分类器的准确性,然后对未标记的数据进行分类。我的问题是我用2个分类器(LDA和QDA)测试它。使用第一个,我获得81%的准确度,当我对未标记的数据(39个对象)进行分类时,它会正确地对所有内容进行分类。但是,当我使用QDA时,我获得了93,74%的准确度,当它对未标记的数据(相同的39个对象)进行分类时,它使用错误的组标记其中3个。有人可以帮我找到我的错误吗?
我的代码:
#"listaTrain" has a list of dictionaries which are the labeled data and will be used for
# training and Cross-Validation
#"listaLabels" has a list of the train labels
#"listaClasificar" has a list of dictionaries which are the unlabeled data
# which I want to label
#"clasificador" is my classifier
X=vec.fit_transform(listaTrain) #I transform the dictionaries to
#a format that sklearn can use
X=preprocessing.scale(X.toarray()) #I scale the values
clasificador.fit(X, listaLabels) #I train the classifier with the train data and
# the train labels
n_samples = X.shape[0]
cv = cross_validation.ShuffleSplit(n_samples, n_iter=300, test_size=0.6, random_state=4)
#I make Cross-Validation dividing the X's data (40% for training and 60% for testing)
scores = cross_validation.cross_val_score(clasificador, X, listaLabels,v=cv)
#I obtain the Cross-validation accuracy
scores.mean() #I obtain the accuracy mean (here is where i obtain 81% and 93%)
testX=vec.transform(listaClasificar) #I transform the dictionaries to a
#format that sklearn can use
testX=preprocessing.scale(testX.toarray()) #I scale the values
predicted=clasificador.predict(testX) #I predict the labels of the unlabeled data