Question

我有两个图像数据集：主题1-200，每个主题都有c（例如c=8）个图像。现在我想将这两个数据集分成我的算法的训练和测试集。我通常希望在以下情况下这样做：

需要的案例

案例1 随机选择每个主题的k1张图片(k1<c)进行培训和k2张图片k2<c和k2+k1<=c）每个主题的测试。所以训练集= k1*200和测试集= k2*200。请记住k1+k2<=c主题在训练集和测试集中完全重叠。

请注意由于我们在培训和测试集中使用相同的主题，k1和k2不得重叠，即假设k1=3和k2=3然后选择任何3进行培训，并从每个主题中选择其他任何3进行测试。因此，约束k1+k2<=c是必要的。

案例2 考虑训练集由随机选择的t个主题组成，测试集由其他200-t个主题组成。训练和测试集中的受试者完全不重叠。随机选择每个k1主题的(k1<c)图片t1进行培训，并为每个k2主题200-t图片进行测试。所以训练集= k1*t和测试集= k2*(200-t)。请注意，k1+k2可能不等于c。甚至k1=k2（可能）

请注意由于我们在培训和测试集中使用不同的主题，k1和k2可能会重叠，并且约束k1+k2<=c不是必需的

案例3 考虑训练和测试集由来自所有科目的图像组成，即两组中的科目完全重叠。随机选择假设m（例如m=470）no。来自数据库的用于训练集的图像，使得至少i（例如i=2）否。每个主题的图像存在（i<c）。然后训练集= m图像。测试集将包含200*c-m图像。

我想在MATLAB中编写代码。任何帮助将不胜感激。提前谢谢。

编辑我试图在MATLAB中实现它。我在这里给出代码：

%% Read the data
%% My data reads as follows:
Name            Size            Bytes  Class     Attributes

a_data         99x1             12672  cell                
a_labels        1x99              792  double              
c               1x1                 8  double              
card_a         11x2               176  double              
unq_a_lab       1x11               88  double             

% where a_data is my total dataset. 
% Assume that it contains total 99 images. 
% a_labels is the labels associated with the images. 
% c is the minimum number of subjects present in a class 
% c is calculated as min (card(subj1),card(subj2),.....)
% card_a is the cardinality of each class present in the database
% card_a = [1,2,3,4,......;10,9,11,9,.....] i.e. card of subj 1 = 10
% card of subj 2 = 9 ,...etc
% unq_a_labels : Number of unique subjects present in the database. 
% Assume it to be 11 (as given).

案例1

%% CASE 1 COMPLETELY OVERLAPPING DATASET EQUAL SIZED PARTITIONS
% Split the dataset into randomly training and testing subsets 
% trainset - each subject k1 images
% testset - eact subject k2 images
% bear in mind constraint : k1+k2<=c
% Total training set = k1*no. of subjects
% Total testing set = k2*no. of subjects
% Both training and testing sets (subjects) are completely overlapping

%split 1 
k1 = 3;
%split 2
k2 = 3;

Train_data_a = cell(length(unq_a_lab)*k1,1);
Test_data_a = cell(length(unq_a_lab)*k2,1);
tr_a_labels = zeros(1,length(unq_a_lab)*k1);
tst_a_labels = zeros(1,length(unq_a_lab)*k2);

t1=0; t2=0;
for i=1:length(unq_a_lab)
    id = randperm(c);
    % split it into 1:k1 and k1+1:k2 points
    for j=1:k1
        Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
        tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));
    end
    for j=1:k2
        Test_data_a{t2+j} = a_data{c*(i-1)+id(j+k1)};
        tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j+k1));        
    end
    t1 = t1+k1; t2 = t2+k2;
end

案例2（a）

%% CASE 2 COMPLETELY NON-OVERLAPPING DATASETS EQUAL SIZED PARTITIONS
% Split the dataset into randomly training and testing subsets 
% trainset - each subject k1 images
% testset - eact subject k2 images
% Total training set = k1* cardinality of Train Set
% Total testing set = k2* cardinality of Test Set
% cardinality of Train Set + cardinality of Test Set = Total cardinality of
% the database
% Both training and testing sets (subjects) are non-overlapping
% p1 = number of subjects in training set
% p2 = number of subjects in testing set

%split 1 
k1 = 3;
%split 2
k2 = 3;
% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
p1 = round((size_p-1)*rand);
p2 = size_p-p1;

Train_data_a = cell(p1*k1,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,p1*k1);
tst_a_labels = zeros(1,p2*k2);
t1=0; t2=0;
for i=1:length(unq_a_lab)
    id = randperm(c);
    % split it into 1:k1 and 1:k2 points
    if i<=p1
        for j=1:k1
            Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
            tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));            
        end
        t1 = t1+k1;
    end

    if i>p1
        for j=1:k2
            Test_data_a{t2+j} = a_data{c*(i-1)+id(j)};
            tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j));                    
        end
        t2 = t2+k2;
    end
end

案例2（b）

进行随机化，以便从总受试者中选择p1个受试者，并且休息形成p2个受试者。

%split 1
k1 = 3;
%split 2
k2 = 3;
% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
p1 = round((size_p-1)*rand);
p2 = size_p-p1;

Train_data_a = cell(p1*k1,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,p1*k1);
tst_a_labels = zeros(1,p2*k2);
x = randperm(length(unq_a_lab));
t1=0; t2=0;
for i=1:length(unq_a_lab)
    id = randperm(c);
    % split it into 1:k1 and 1:k2 points
    if i<=p1
        for j=1:k1
            Train_data_a{t1+j} = a_data{c*(x(i)-1)+id(j)};
            tr_a_labels(1,t1+j) = a_labels(c*(x(i)-1)+id(j));
        end
        t1 = t1+k1;
    end    
    if i>p1
        for j=1:k2
            Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
            tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
        end
        t2 = t2+k2;
    end
end

案例3

%% CASE 3 COMPLETELY NON OVERLAPPING DATASETS UNEQUAL SIZED PARTITIONS
%% Split the dataset into randomly training and testing subsets
% trainset - Total m images and each subject atleast having i=floor(m/p1) images
% testset - eact subject k2 images
% Total training set = m images
% Total testing set = k2*p2 images
% cardinality of Train Set + cardinality of Test Set = Total cardinality of
% the database
% Both training and testing sets (subjects) are non-overlapping

% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
% p1 = round((size_p-1)*rand);
p1 = 6;
p2 = size_p-p1;

%split 1
m = 29;
min_reqd = floor(m/p1);
%split 2
k2 = 3;

Train_data_a = cell(m,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,m);
dummy_labels = tr_a_labels;
tst_a_labels = zeros(1,p2*k2);
x = randperm(length(unq_a_lab));
% filling up the first min_reqd for each class
t1=1;
for j=1:p1
    idx = randperm(c);
    idx = idx(1:min_reqd);
    for k=1:min_reqd
        dummy_labels(t1) = c*(x(j)-1)+idx(k);
        t1 = t1+1;
    end
end
% form the numberset
num_pack = zeros(1,c*p1);
t2=1;
for j=1:p1
    for k=1:c
        num_pack(1,t2) = c*(x(j)-1)+k;
        t2 = t2+1;
    end
end
% getting the indices that have not been already selected previously
% using the set difference operation
% setdiff(A,B) is the values of A that are not in B
new_a_labels = setdiff(num_pack,dummy_labels);
idx = randperm(length(new_a_labels));
% randomly selecting the left amount of values from the set difference
% subset
idx = new_a_labels(idx(1:m-(min_reqd*p1)));
% inserting the values into the matrix
dummy_labels(t1:t1+length(idx)-1) = idx;
% sorting the matrix
[val,idx] = sort(dummy_labels);
% rearranging the matrix
dummy_labels = dummy_labels(idx);

% using the indices of the dummy variables to get the training set and 
% their corresponding labels
for i=1:m
    Train_data_a{i} = a_data{dummy_labels(i)};
    tr_a_labels(1,i) = a_labels(dummy_labels(i));
end

% getting the testing set as previously done in case 2
t2=0;
for i=1:length(unq_a_lab)
    % Random selection of k2 points for the testing set
    id = randperm(c);
    if i>p1
        for j=1:k2
            Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
            tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
        end
        t2 = t2+k2;
    end
end*

注意

我相信我的CASE 1和2是正确的。如果有错，请指出我。我需要帮助CASE 3.完成案例3 但完全不确定。

将数据集划分为训练和测试数据集

需要的案例

案例1

案例2（a）

案例2（b）

案例3

注意

0 个答案: