Scikit-learn - 如何正确使用交叉验证

时间:2015-07-24 18:54:48

标签: python scikit-learn cross-validation supervised-learning

我正在开发一个程序,我有一些数据(标记和未标记)和2个不同的组(“artritis”和“fibro”)。我想获得分类器的准确性,然后对未标记的数据进行分类。我的问题是我用2个分类器(LDA和QDA)测试它。使用第一个,我获得81%的准确度,当我对未标记的数据(39个对象)进行分类时,它会正确地对所有内容进行分类。但是,当我使用QDA时,我获得了93,74%的准确度,当它对未标记的数据(相同的39个对象)进行分类时,它使用错误的组标记其中3个。有人可以帮我找到我的错误吗?

我的代码:

    #"listaTrain" has a list of dictionaries which are the labeled data and will be used for
    # training and Cross-Validation
    #"listaLabels" has a list of the train labels
    #"listaClasificar" has a list of dictionaries which are the unlabeled data 
    # which I want to label
    #"clasificador" is my classifier 

    X=vec.fit_transform(listaTrain) #I transform the dictionaries to 
    #a format that sklearn can use
    X=preprocessing.scale(X.toarray()) #I scale the values

    clasificador.fit(X, listaLabels) #I train the classifier with the train data and
    # the train labels
    n_samples = X.shape[0]
    cv = cross_validation.ShuffleSplit(n_samples, n_iter=300, test_size=0.6, random_state=4)
    #I make Cross-Validation dividing the X's data (40% for training and 60% for testing)
    scores = cross_validation.cross_val_score(clasificador, X, listaLabels,v=cv)
    #I obtain the Cross-validation accuracy
    scores.mean() #I obtain the accuracy mean (here is where i obtain 81% and 93%)

    testX=vec.transform(listaClasificar) #I transform the dictionaries to a 
    #format that sklearn can use
    testX=preprocessing.scale(testX.toarray()) #I scale the values

    predicted=clasificador.predict(testX) #I predict the labels of the unlabeled data

0 个答案:

没有答案