Question

我正在开发一个程序，我有一些数据（标记和未标记）和2个不同的组（“artritis”和“fibro”）。我想获得分类器的准确性，然后对未标记的数据进行分类。我的问题是我用2个分类器（LDA和QDA）测试它。使用第一个，我获得81％的准确度，当我对未标记的数据（39个对象）进行分类时，它会正确地对所有内容进行分类。但是，当我使用QDA时，我获得了93,74％的准确度，当它对未标记的数据（相同的39个对象）进行分类时，它使用错误的组标记其中3个。有人可以帮我找到我的错误吗？

我的代码：

    #"listaTrain" has a list of dictionaries which are the labeled data and will be used for
    # training and Cross-Validation
    #"listaLabels" has a list of the train labels
    #"listaClasificar" has a list of dictionaries which are the unlabeled data 
    # which I want to label
    #"clasificador" is my classifier 

    X=vec.fit_transform(listaTrain) #I transform the dictionaries to 
    #a format that sklearn can use
    X=preprocessing.scale(X.toarray()) #I scale the values

    clasificador.fit(X, listaLabels) #I train the classifier with the train data and
    # the train labels
    n_samples = X.shape[0]
    cv = cross_validation.ShuffleSplit(n_samples, n_iter=300, test_size=0.6, random_state=4)
    #I make Cross-Validation dividing the X's data (40% for training and 60% for testing)
    scores = cross_validation.cross_val_score(clasificador, X, listaLabels,v=cv)
    #I obtain the Cross-validation accuracy
    scores.mean() #I obtain the accuracy mean (here is where i obtain 81% and 93%)

    testX=vec.transform(listaClasificar) #I transform the dictionaries to a 
    #format that sklearn can use
    testX=preprocessing.scale(testX.toarray()) #I scale the values

    predicted=clasificador.predict(testX) #I predict the labels of the unlabeled data

Scikit-learn - 如何正确使用交叉验证

0 个答案: