改组提高准确度 - sklearn - MultinomialNaiveBayes

时间:2014-02-19 12:19:50

标签: python machine-learning scikit-learn

我正在尝试计算scikit-learn中Multinomial NaiveBayes算法的准确性。

以下是代码:

import numpy as np
import random
from sklearn import naive_bayes
from sklearn.preprocessing import LabelBinarizer
import random
from collections import Counter
dim0 = ['high', 'low', 'med', 'vhigh']
dim1 = ['high', 'low', 'med', 'vhigh']
dim2 = ['2', '3', '4', '5more']
dim3 = ['2', '4', 'more']
dim4 = [ 'big', 'med', 'small' ]
dim5 = ['high' , 'low', 'med' ]
target = ['acc', 'good', 'unacc', 'vgood' ]

dimensions = [ dim0, dim1, dim2, dim3, dim4, dim5, target]

# function to read dataset
def readDataSet(fname):
    f = open(fname, 'r')
    dataset = []
    for line in f:
        words = []
        tokenized = line.strip().split(',')
        if len(tokenized) != 7:
            continue
        for w in tokenized:
            words.append(w)
        dataset.append(np.array(words))
    return np.array(dataset)

# split the dataset into X - features and Y - labels / targets
# assumes last column of the data is the target
def XYfromDataset(dataset):
    X = []
    Y = []
    for d in dataset:
        X.append(np.array(d[:-1]))
        Y.append(d[-1])
    return np.array(X), np.array(Y)

def splitXY(X, Y, perc):
    splitpos = int(len(X) * perc)

    X_train = X[:splitpos]
    X_test = X[splitpos:]
    Y_train = Y[:splitpos]
    Y_test = Y[splitpos:]

    return (X_train, Y_train, X_test, Y_test)


def mapDimension(dimen, mapping):
    res = []
    for d in dimen:
        res.append(float(mapping.index(d)))
    return np.array(res)


def runTrails( dataset, split = 0.66 ):    
    random.shuffle(dataset, random.random)

    (X,Y) = XYfromDataset(dataset)
    (X_train, Y_train, X_test, Y_test) = splitXY(X, Y, split)
    mnb = naive_bayes.MultinomialNB()
    mnb.fit(X_train, Y_train)
    score = mnb.score(X_test, Y_test)
    mnb = None
    return score   





dataset = readDataSet('car.txt')
print "Class distributution:" , Counter(dataset[:,6])
for d in range(dataset.shape[1]):
    dataset[:, d] = mapDimension(dataset[: , d] , dimensions[d])
dataset = dataset.astype(float)

score = 0.0
num_trails = 10
for t in range(num_trails):
    acc = runTrails(dataset)
    print "Trail", t, "Accuracy:", acc
    score += acc

print score / num_trails

可以在http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

找到数据集

我对程序的输出感到困惑:

Trail 0 Accuracy: 0.758503401361
Trail 1 Accuracy: 0.84693877551
Trail 2 Accuracy: 0.926870748299
Trail 3 Accuracy: 0.96768707483
Trail 4 Accuracy: 0.979591836735
Trail 5 Accuracy: 0.996598639456
Trail 6 Accuracy: 1.0
Trail 7 Accuracy: 1.0
Trail 8 Accuracy: 1.0
Trail 9 Accuracy: 1.0
0.947619047619

如果我删除方法runTrail()中的random.shuffle(),这是输出

Class distributution: Counter({'unacc': 1210, 'acc': 384, 'good': 69, 'vgood': 65})
Trail 0 Accuracy: 0.583333333333
Trail 1 Accuracy: 0.583333333333
Trail 2 Accuracy: 0.583333333333
Trail 3 Accuracy: 0.583333333333
Trail 4 Accuracy: 0.583333333333
Trail 5 Accuracy: 0.583333333333
Trail 6 Accuracy: 0.583333333333
Trail 7 Accuracy: 0.583333333333
Trail 8 Accuracy: 0.583333333333
Trail 9 Accuracy: 0.583333333333
0.583333333333

据我所知,改组会影响此数据集中算法的准确性 - 因为数据集是按类排序的。

因此,第一次迭代的准确度约为70.

但为什么准确度会不断提高?对我来说完全是无稽之谈。 如果算法继续训练,它会表现得更好,但在这里我使用的是一个新的实例也会改组数据集。

0 个答案:

没有答案