适用于sem监督学习(期望最大化)的Python NLTK朴素贝叶斯

时间:2019-02-17 23:41:01

标签: python naivebayes

我正在尝试基于朴素贝叶斯的期望最大化实现一种半监督学习算法,如下所述:

Semi-supervised Naive Bayes with NLTK

我正在进行文本分类,其中数据集包含一组评论和一个相关标签,例如:

Label Review
1     connect gps app connect gps matter long  gps set high accuracy  
1     wish would interest google provide weekly monthly summary 
1     useless talk gps phone  20 minute run data 
0     great app glad used track perfectly

我正在使用NLTK库,并获得了基本的朴素贝叶斯分类器,可对数据集进行十倍交叉验证,代码如下:

naive_bayes.cross_validation(featuresets, 10)

def cross_validation(all_data, n_sets):
    set_size = 1.0 / n_sets
    shuffled_data = all_data.copy()
    random.shuffle(shuffled_data)
    cumulative_percent = 0
    for i in range(0, n_sets):
        n_training = int(set_size * len(all_data))
        split_start = i * n_training
        split_end = (i + 1) * n_training
        print("train split_start: " + str(split_start) + " - split_end: " + str(split_end))
        train_data_before = shuffled_data[:split_start]
        train_data_after = shuffled_data[split_end:]
        train_data = train_data_before + train_data_after
        test_data = shuffled_data[split_start:split_end]
        # print("train size: " + str(len(train_data)) + " - test size: " + str(len(test_data)))
        classifier = nltk.NaiveBayesClassifier.train(train_data, nltk.LaplaceProbDist)
        correct = 0
        for i, (t, l) in enumerate(test_data):
            classified = classifier.classify(t)
            # actual = labeled_reviews[split_point:][i][1]
            if classified == l:
                correct += 1
        print(str(correct) + "/" + str(len(test_data)))
        correct_percent = correct/len(test_data)
        cumulative_percent += correct_percent
        print(str(correct_percent) + "%")
    print("Average result: " + str(cumulative_percent / n_sets) + "%")

此预测的准确性约为85%

但是,我无法弄清半监督的方面,我尝试了以下方法,但是它降低了准确性,实际上,它降至70%以下

### BEGIN EM Algorithm - Naive Bayes

 n_training = 3000
 labeled_data = featuresets[:n_training]
 unlabeled_data = featuresets[n_training:]

 classifier = nltk.NaiveBayesClassifier.train(labeled_data, nltk.LaplaceProbDist)

 max_iterations = 100
 for iteration in range(0, max_iterations):
    print("Iteration: " + str(iteration))
     found_labeled_data = []
     correct = 0  # For evaluation not algorithm
     for i, (t, l) in enumerate(unlabeled_data):
         classified = classifier.classify(t)
         if classified == l:  # For evaluation not algorithm
             correct += 1  # For evaluation not algorithm
         found_labeled_data.append((t, classified))
     print(str(correct) + "/" + str(len(unlabeled_data)))
     correct_percent = 100 * correct / len(unlabeled_data)
     print(str(correct_percent) + "%")
     classifier = nltk.NaiveBayesClassifier.train(labeled_data + found_labeled_data, nltk.LaplaceProbDist)

不知道我在这里做错了什么,有人可以帮忙吗?

0 个答案:

没有答案