我正在尝试基于朴素贝叶斯的期望最大化实现一种半监督学习算法,如下所述:
Semi-supervised Naive Bayes with NLTK
我正在进行文本分类,其中数据集包含一组评论和一个相关标签,例如:
Label Review
1 connect gps app connect gps matter long gps set high accuracy
1 wish would interest google provide weekly monthly summary
1 useless talk gps phone 20 minute run data
0 great app glad used track perfectly
我正在使用NLTK库,并获得了基本的朴素贝叶斯分类器,可对数据集进行十倍交叉验证,代码如下:
naive_bayes.cross_validation(featuresets, 10)
def cross_validation(all_data, n_sets):
set_size = 1.0 / n_sets
shuffled_data = all_data.copy()
random.shuffle(shuffled_data)
cumulative_percent = 0
for i in range(0, n_sets):
n_training = int(set_size * len(all_data))
split_start = i * n_training
split_end = (i + 1) * n_training
print("train split_start: " + str(split_start) + " - split_end: " + str(split_end))
train_data_before = shuffled_data[:split_start]
train_data_after = shuffled_data[split_end:]
train_data = train_data_before + train_data_after
test_data = shuffled_data[split_start:split_end]
# print("train size: " + str(len(train_data)) + " - test size: " + str(len(test_data)))
classifier = nltk.NaiveBayesClassifier.train(train_data, nltk.LaplaceProbDist)
correct = 0
for i, (t, l) in enumerate(test_data):
classified = classifier.classify(t)
# actual = labeled_reviews[split_point:][i][1]
if classified == l:
correct += 1
print(str(correct) + "/" + str(len(test_data)))
correct_percent = correct/len(test_data)
cumulative_percent += correct_percent
print(str(correct_percent) + "%")
print("Average result: " + str(cumulative_percent / n_sets) + "%")
此预测的准确性约为85%
但是,我无法弄清半监督的方面,我尝试了以下方法,但是它降低了准确性,实际上,它降至70%以下
### BEGIN EM Algorithm - Naive Bayes
n_training = 3000
labeled_data = featuresets[:n_training]
unlabeled_data = featuresets[n_training:]
classifier = nltk.NaiveBayesClassifier.train(labeled_data, nltk.LaplaceProbDist)
max_iterations = 100
for iteration in range(0, max_iterations):
print("Iteration: " + str(iteration))
found_labeled_data = []
correct = 0 # For evaluation not algorithm
for i, (t, l) in enumerate(unlabeled_data):
classified = classifier.classify(t)
if classified == l: # For evaluation not algorithm
correct += 1 # For evaluation not algorithm
found_labeled_data.append((t, classified))
print(str(correct) + "/" + str(len(unlabeled_data)))
correct_percent = 100 * correct / len(unlabeled_data)
print(str(correct_percent) + "%")
classifier = nltk.NaiveBayesClassifier.train(labeled_data + found_labeled_data, nltk.LaplaceProbDist)
不知道我在这里做错了什么,有人可以帮忙吗?