用于多类的朴素贝叶斯分类器:获得相同的错误率

时间:2012-10-06 20:59:55

标签: matlab machine-learning

我已经为多类实现了Naive Bayse分类器,但问题是当我增加训练数据集时我的错误率是相同的。我正在调试这个,但无法弄清楚为什么会发生这种情况。所以我想我会在这里发帖,看看我做错了什么。

%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate. 
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
    for i=1:iter

        %Getting the index common to test and train data
        idx = randperm(size(dm.data,1))

        %Using same idx for data and labels
        shuffledMatrix_data = dm.data(idx,:);
        shuffledMatrix_label = dm.labels(idx,:);

        percent_data_80 = round((0.8) * length(shuffledMatrix_data));


        %Doing 80-20 split
        train = shuffledMatrix_data(1:percent_data_80,:);

        test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);

        %Getting the label data from the 80:20 split
        train_labels = shuffledMatrix_label(1:percent_data_80,:);

        test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);

        %Getting the array of percents [5 10 15..]
        percent_tracker = zeros(length(percent), 2);

        for pRows = 1:length(percent)

            percentOfRows = round((percent(pRows)/100) * length(train));
            new_train = train(1:percentOfRows,:);
            new_train_label = train_labels(1:percentOfRows);

            %get unique labels in training
            numClasses = size(unique(new_train_label),1);
            classMean = zeros(numClasses,size(new_train,2));
            classStd = zeros(numClasses, size(new_train,2));
            priorClass = zeros(numClasses, size(2,1));

            % Doing the K class mean and std with prior
            for kclass=1:numClasses
                classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
                classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
                priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
            end

            error = 0;

            p = zeros(numClasses,1);

            % Calculating the posterior for each test row for each k class
            for testRow=1:length(test)
                c=0; k=0;
                for class=1:numClasses
                    temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
                    p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
                end
                %Take the max of posterior 
                [c,k] = max(p(1,:));
                if test_labels(testRow) ~= k
                    error = error +  1;
                end
            end
            avgError = error/length(test);
            percent_tracker(pRows,:) = [avgError percent(pRows)];
            tPercent = percent_tracker;
            plot(percent_tracker)
        end
    end
end

这是我数据的维度

x = 

      data: [768x8 double]
    labels: [768x1 double]

我正在使用UCI的Pima数据集

1 个答案:

答案 0 :(得分:2)

您实施培训数据本身的结果是什么?它完全适合吗?

很难确定,但有几件事我注意到了:

  1. 每个班级都有重要的培训数据。如果没有训练数据,你无法真正训练分类器来识别一个类。
  2. 如果可能的话,训练样例的数量不应偏向某些类。例如,如果在2级分类中训练的数量和第1类的交叉验证示例仅构成数据的5%,那么总是返回第2类的函数将具有5%的误差。您是否尝试过单独检查精度和召回?
  3. 您正在尝试将正态分布拟合到类中的每个要素,然后将其用于后验概率。我不确定它在平滑方面是如何发挥作用的。您是否可以尝试通过简单的计数重新实现它,看看它是否会产生任何不同的结果?
  4. 也可能是功能高度冗余,而贝叶斯方法会超出概率。