我正在尝试计算scikit-learn中Multinomial NaiveBayes算法的准确性。
以下是代码:
import numpy as np
import random
from sklearn import naive_bayes
from sklearn.preprocessing import LabelBinarizer
import random
from collections import Counter
dim0 = ['high', 'low', 'med', 'vhigh']
dim1 = ['high', 'low', 'med', 'vhigh']
dim2 = ['2', '3', '4', '5more']
dim3 = ['2', '4', 'more']
dim4 = [ 'big', 'med', 'small' ]
dim5 = ['high' , 'low', 'med' ]
target = ['acc', 'good', 'unacc', 'vgood' ]
dimensions = [ dim0, dim1, dim2, dim3, dim4, dim5, target]
# function to read dataset
def readDataSet(fname):
f = open(fname, 'r')
dataset = []
for line in f:
words = []
tokenized = line.strip().split(',')
if len(tokenized) != 7:
continue
for w in tokenized:
words.append(w)
dataset.append(np.array(words))
return np.array(dataset)
# split the dataset into X - features and Y - labels / targets
# assumes last column of the data is the target
def XYfromDataset(dataset):
X = []
Y = []
for d in dataset:
X.append(np.array(d[:-1]))
Y.append(d[-1])
return np.array(X), np.array(Y)
def splitXY(X, Y, perc):
splitpos = int(len(X) * perc)
X_train = X[:splitpos]
X_test = X[splitpos:]
Y_train = Y[:splitpos]
Y_test = Y[splitpos:]
return (X_train, Y_train, X_test, Y_test)
def mapDimension(dimen, mapping):
res = []
for d in dimen:
res.append(float(mapping.index(d)))
return np.array(res)
def runTrails( dataset, split = 0.66 ):
random.shuffle(dataset, random.random)
(X,Y) = XYfromDataset(dataset)
(X_train, Y_train, X_test, Y_test) = splitXY(X, Y, split)
mnb = naive_bayes.MultinomialNB()
mnb.fit(X_train, Y_train)
score = mnb.score(X_test, Y_test)
mnb = None
return score
dataset = readDataSet('car.txt')
print "Class distributution:" , Counter(dataset[:,6])
for d in range(dataset.shape[1]):
dataset[:, d] = mapDimension(dataset[: , d] , dimensions[d])
dataset = dataset.astype(float)
score = 0.0
num_trails = 10
for t in range(num_trails):
acc = runTrails(dataset)
print "Trail", t, "Accuracy:", acc
score += acc
print score / num_trails
可以在http://archive.ics.uci.edu/ml/datasets/Car+Evaluation
找到数据集我对程序的输出感到困惑:
Trail 0 Accuracy: 0.758503401361
Trail 1 Accuracy: 0.84693877551
Trail 2 Accuracy: 0.926870748299
Trail 3 Accuracy: 0.96768707483
Trail 4 Accuracy: 0.979591836735
Trail 5 Accuracy: 0.996598639456
Trail 6 Accuracy: 1.0
Trail 7 Accuracy: 1.0
Trail 8 Accuracy: 1.0
Trail 9 Accuracy: 1.0
0.947619047619
如果我删除方法runTrail()中的random.shuffle(),这是输出
Class distributution: Counter({'unacc': 1210, 'acc': 384, 'good': 69, 'vgood': 65})
Trail 0 Accuracy: 0.583333333333
Trail 1 Accuracy: 0.583333333333
Trail 2 Accuracy: 0.583333333333
Trail 3 Accuracy: 0.583333333333
Trail 4 Accuracy: 0.583333333333
Trail 5 Accuracy: 0.583333333333
Trail 6 Accuracy: 0.583333333333
Trail 7 Accuracy: 0.583333333333
Trail 8 Accuracy: 0.583333333333
Trail 9 Accuracy: 0.583333333333
0.583333333333
据我所知,改组会影响此数据集中算法的准确性 - 因为数据集是按类排序的。
因此,第一次迭代的准确度约为70.
但为什么准确度会不断提高?对我来说完全是无稽之谈。 如果算法继续训练,它会表现得更好,但在这里我使用的是一个新的实例和也会改组数据集。