朴素贝叶斯分类器针对假定相同的样本量提供了2个不同的答案?

时间:2018-12-05 23:45:51

标签: python classification nltk sentiment-analysis naivebayes

我不确定这2个之间有什么区别:

classifier = NaiveBayesClassifier.train(train_d)

d1 = (nltk.classify.accuracy(classifier,train_d[:500]))*100
d2 = (nltk.classify.accuracy(classifier,train_d[:600]))*100
d3 = (nltk.classify.accuracy(classifier,train_d[:700]))*100
d4 = (nltk.classify.accuracy(classifier,train_d[:800]))*100
d5 = (nltk.classify.accuracy(classifier,train_d[:900]))*100
d6 = (nltk.classify.accuracy(classifier,train_d[:1000]))*100
d7 = (nltk.classify.accuracy(classifier,train_d[:1100]))*100
d8 = (nltk.classify.accuracy(classifier,train_d[:1200]))*100
d9 = (nltk.classify.accuracy(classifier,train_d[:1300]))*100
d10 = (nltk.classify.accuracy(classifier,train_d[:1400]))*100
dvd_results = [d1,d2,d3,d4,d5,d6,d7,d8,d9,d10]

df1 = pd.DataFrame(list(zip(sample_sizes,dvd_results)),columns=["Sample Size","Accuracy"])
display(df1)

哪个给了我结果:

 Sample Size    Accuracy
0   500     99.400000
1   600     99.500000
2   700     99.285714
3   800     99.000000
4   900     99.111111
5   1000    99.100000
6   1100    99.181818
7   1200    99.250000
8   1300    99.153846
9   1400    99.071429

与我原本认为的相同:

classifier_d1 = NaiveBayesClassifier.train(train_d[:500])
classifier_d2 = NaiveBayesClassifier.train(train_d[:600])
classifier_d3 = NaiveBayesClassifier.train(train_d[:700])
classifier_d4 = NaiveBayesClassifier.train(train_d[:800])
classifier_d5 = NaiveBayesClassifier.train(train_d[:900])
classifier_d6 = NaiveBayesClassifier.train(train_d[:1000])
classifier_d7 = NaiveBayesClassifier.train(train_d[:1100])
classifier_d8 = NaiveBayesClassifier.train(train_d[:1200])
classifier_d9 = NaiveBayesClassifier.train(train_d[:1300])
classifier_d10 = NaiveBayesClassifier.train(train_d[:1400])
d1 = (nltk.classify.accuracy(classifier_d1,train_d))*100
d2 = (nltk.classify.accuracy(classifier_d2,train_d))*100
d3 = (nltk.classify.accuracy(classifier_d3,train_d))*100
d4 = (nltk.classify.accuracy(classifier_d4,train_d))*100
d5 = (nltk.classify.accuracy(classifier_d5,train_d))*100
d6 = (nltk.classify.accuracy(classifier_d6,train_d))*100
d7 = (nltk.classify.accuracy(classifier_d7,train_d))*100
d8 = (nltk.classify.accuracy(classifier_d8,train_d))*100
d9 = (nltk.classify.accuracy(classifier_d9,train_d))*100
d10 = (nltk.classify.accuracy(classifier_d10,train_d))*100
dvd_results = [d1,d2,d3,d4,d5,d6,d7,d8,d9,d10]

哪个给了我结果:

Sample Size Accuracy
0   500     50.000000
1   600     50.000000
2   700     50.000000
3   800     60.142857
4   900     88.000000
5   1000    93.500000
6   1100    93.785714
7   1200    96.428571
8   1300    97.428571
9   1400    99.071429

老实说,我看不到这两个代码块之间有什么区别,因为它们都已经由分类器训练过了,而这仅仅是在看起来好像很混乱的情况下获得了准确性。另外,如果有人可以回答我的原因,那么对于700号及以下的样本,我的准确率仅为50%!部分由于这个原因,我将假设第一个块是正确的方法,而第二个块我刚刚弄乱了分类器。 las,我不知道为什么!

0 个答案:

没有答案