将数据集划分为训练和测试数据集

时间:2015-03-16 10:23:30

标签: matlab validation testing training-data data-partitioning

我有两个图像数据集:主题1-200,每个主题都有c(例如c=8)个图像。现在我想将这两个数据集分成我的算法的训练和测试集。我通常希望在以下情况下这样做:

需要的案例

  1. 案例1 随机选择每个主题的k1张图片(k1<c)进行培训和k2张图片k2<ck2+k1<=c)每个主题的测试。所以训练集= k1*200和测试集= k2*200。请记住k1+k2<=c主题在训练集和测试集中完全重叠
  2. 请注意由于我们在培训和测试集中使用相同的主题,k1k2不得重叠,即假设k1=3k2=3然后选择任何3进行培训,并从每个主题中选择其他任何3进行测试。因此,约束k1+k2<=c是必要的。

    1. 案例2 考虑训练集由随机选择的t个主题组成,测试集由其他200-t个主题组成。训练和测试集中的受试者完全不重叠。随机选择每个k1主题的(k1<c)图片t1进行培训,并为每个k2主题200-t图片进行测试。所以训练集= k1*t和测试集= k2*(200-t)。请注意,k1+k2可能不等于c。甚至k1=k2(可能)
    2. 请注意由于我们在培训和测试集中使用不同的主题,k1k2可能会重叠,并且约束k1+k2<=c不是必需的

      1. 案例3 考虑训练和测试集由来自所有科目的图像组成,即两组中的科目完全重叠。随机选择假设m(例如m=470)no。来自数据库的用于训练集的图像,使得至少i(例如i=2)否。每个主题的图像存在(i<c)。然后训练集= m图像。测试集将包含200*c-m图像。
      2. 我想在MATLAB中编写代码。任何帮助将不胜感激。 提前谢谢。

        编辑我试图在MATLAB中实现它。我在这里给出代码:

        %% Read the data
        %% My data reads as follows:
        Name            Size            Bytes  Class     Attributes
        
        a_data         99x1             12672  cell                
        a_labels        1x99              792  double              
        c               1x1                 8  double              
        card_a         11x2               176  double              
        unq_a_lab       1x11               88  double             
        
        % where a_data is my total dataset. 
        % Assume that it contains total 99 images. 
        % a_labels is the labels associated with the images. 
        % c is the minimum number of subjects present in a class 
        % c is calculated as min (card(subj1),card(subj2),.....)
        % card_a is the cardinality of each class present in the database
        % card_a = [1,2,3,4,......;10,9,11,9,.....] i.e. card of subj 1 = 10
        % card of subj 2 = 9 ,...etc
        % unq_a_labels : Number of unique subjects present in the database. 
        % Assume it to be 11 (as given).
        

        案例1

        %% CASE 1 COMPLETELY OVERLAPPING DATASET EQUAL SIZED PARTITIONS
        % Split the dataset into randomly training and testing subsets 
        % trainset - each subject k1 images
        % testset - eact subject k2 images
        % bear in mind constraint : k1+k2<=c
        % Total training set = k1*no. of subjects
        % Total testing set = k2*no. of subjects
        % Both training and testing sets (subjects) are completely overlapping
        
        %split 1 
        k1 = 3;
        %split 2
        k2 = 3;
        
        Train_data_a = cell(length(unq_a_lab)*k1,1);
        Test_data_a = cell(length(unq_a_lab)*k2,1);
        tr_a_labels = zeros(1,length(unq_a_lab)*k1);
        tst_a_labels = zeros(1,length(unq_a_lab)*k2);
        
        t1=0; t2=0;
        for i=1:length(unq_a_lab)
            id = randperm(c);
            % split it into 1:k1 and k1+1:k2 points
            for j=1:k1
                Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
                tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));
            end
            for j=1:k2
                Test_data_a{t2+j} = a_data{c*(i-1)+id(j+k1)};
                tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j+k1));        
            end
            t1 = t1+k1; t2 = t2+k2;
        end
        

        案例2(a)

        %% CASE 2 COMPLETELY NON-OVERLAPPING DATASETS EQUAL SIZED PARTITIONS
        % Split the dataset into randomly training and testing subsets 
        % trainset - each subject k1 images
        % testset - eact subject k2 images
        % Total training set = k1* cardinality of Train Set
        % Total testing set = k2* cardinality of Test Set
        % cardinality of Train Set + cardinality of Test Set = Total cardinality of
        % the database
        % Both training and testing sets (subjects) are non-overlapping
        % p1 = number of subjects in training set
        % p2 = number of subjects in testing set
        
        %split 1 
        k1 = 3;
        %split 2
        k2 = 3;
        % size of the partitions
        % p1 = number of classes in the training sets
        % p2 = number of classes in the testing sets
        size_p = length(unq_a_lab);
        p1 = round((size_p-1)*rand);
        p2 = size_p-p1;
        
        Train_data_a = cell(p1*k1,1);
        Test_data_a = cell(p2*k2,1);
        tr_a_labels = zeros(1,p1*k1);
        tst_a_labels = zeros(1,p2*k2);
        t1=0; t2=0;
        for i=1:length(unq_a_lab)
            id = randperm(c);
            % split it into 1:k1 and 1:k2 points
            if i<=p1
                for j=1:k1
                    Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
                    tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));            
                end
                t1 = t1+k1;
            end
        
            if i>p1
                for j=1:k2
                    Test_data_a{t2+j} = a_data{c*(i-1)+id(j)};
                    tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j));                    
                end
                t2 = t2+k2;
            end
        end
        

        案例2(b)

        进行随机化,以便从总受试者中选择p1个受试者,并且休息形成p2个受试者。

        %split 1
        k1 = 3;
        %split 2
        k2 = 3;
        % size of the partitions
        % p1 = number of classes in the training sets
        % p2 = number of classes in the testing sets
        size_p = length(unq_a_lab);
        p1 = round((size_p-1)*rand);
        p2 = size_p-p1;
        
        Train_data_a = cell(p1*k1,1);
        Test_data_a = cell(p2*k2,1);
        tr_a_labels = zeros(1,p1*k1);
        tst_a_labels = zeros(1,p2*k2);
        x = randperm(length(unq_a_lab));
        t1=0; t2=0;
        for i=1:length(unq_a_lab)
            id = randperm(c);
            % split it into 1:k1 and 1:k2 points
            if i<=p1
                for j=1:k1
                    Train_data_a{t1+j} = a_data{c*(x(i)-1)+id(j)};
                    tr_a_labels(1,t1+j) = a_labels(c*(x(i)-1)+id(j));
                end
                t1 = t1+k1;
            end    
            if i>p1
                for j=1:k2
                    Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
                    tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
                end
                t2 = t2+k2;
            end
        end
        

        案例3

        %% CASE 3 COMPLETELY NON OVERLAPPING DATASETS UNEQUAL SIZED PARTITIONS
        %% Split the dataset into randomly training and testing subsets
        % trainset - Total m images and each subject atleast having i=floor(m/p1) images
        % testset - eact subject k2 images
        % Total training set = m images
        % Total testing set = k2*p2 images
        % cardinality of Train Set + cardinality of Test Set = Total cardinality of
        % the database
        % Both training and testing sets (subjects) are non-overlapping
        
        % size of the partitions
        % p1 = number of classes in the training sets
        % p2 = number of classes in the testing sets
        size_p = length(unq_a_lab);
        % p1 = round((size_p-1)*rand);
        p1 = 6;
        p2 = size_p-p1;
        
        %split 1
        m = 29;
        min_reqd = floor(m/p1);
        %split 2
        k2 = 3;
        
        Train_data_a = cell(m,1);
        Test_data_a = cell(p2*k2,1);
        tr_a_labels = zeros(1,m);
        dummy_labels = tr_a_labels;
        tst_a_labels = zeros(1,p2*k2);
        x = randperm(length(unq_a_lab));
        % filling up the first min_reqd for each class
        t1=1;
        for j=1:p1
            idx = randperm(c);
            idx = idx(1:min_reqd);
            for k=1:min_reqd
                dummy_labels(t1) = c*(x(j)-1)+idx(k);
                t1 = t1+1;
            end
        end
        % form the numberset
        num_pack = zeros(1,c*p1);
        t2=1;
        for j=1:p1
            for k=1:c
                num_pack(1,t2) = c*(x(j)-1)+k;
                t2 = t2+1;
            end
        end
        % getting the indices that have not been already selected previously
        % using the set difference operation
        % setdiff(A,B) is the values of A that are not in B
        new_a_labels = setdiff(num_pack,dummy_labels);
        idx = randperm(length(new_a_labels));
        % randomly selecting the left amount of values from the set difference
        % subset
        idx = new_a_labels(idx(1:m-(min_reqd*p1)));
        % inserting the values into the matrix
        dummy_labels(t1:t1+length(idx)-1) = idx;
        % sorting the matrix
        [val,idx] = sort(dummy_labels);
        % rearranging the matrix
        dummy_labels = dummy_labels(idx);
        
        % using the indices of the dummy variables to get the training set and 
        % their corresponding labels
        for i=1:m
            Train_data_a{i} = a_data{dummy_labels(i)};
            tr_a_labels(1,i) = a_labels(dummy_labels(i));
        end
        
        % getting the testing set as previously done in case 2
        t2=0;
        for i=1:length(unq_a_lab)
            % Random selection of k2 points for the testing set
            id = randperm(c);
            if i>p1
                for j=1:k2
                    Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
                    tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
                end
                t2 = t2+k2;
            end
        end*
        

        注意

        我相信我的CASE 1和2是正确的。如果有错,请指出我。我需要帮助CASE 3.完成案例3 但完全不确定。

0 个答案:

没有答案