将数据集拆分为matlab / octave中的两个子集

时间:2018-03-12 19:09:35

标签: matlab octave

将数据集拆分为两个子集,例如“train”和“test” 火车组包含80%的数据,测试集包含剩余的20%。

拆分意味着生成长度等于的逻辑索引 数据集中的观察数量,培训数量为1 样品,0为测试样品。

N =长度(data.x)

输出:名为idxTrain和idxTest的逻辑数组。

1 个答案:

答案 0 :(得分:1)

这应该可以解决问题:

% Generate sample data...
data = rand(32000,1);

% Calculate the number of training entries...
train_off = round(numel(data) * 0.8);

% Split data into training and test vectors...
train = data(1:train_off);
test = data(train_off+1:end);

但是,如果您真的想依赖逻辑索引,可以按以下步骤操作:

% Generate sample data...
data = rand(32000,1);
data_len = numel(data);

% Calculate the number of training entries...
train_count = round(data_len * 0.8);

% Create the logical indexing...
is_training = [true(train_count,1); false(data_len-train_count,1)];

% Split data into training and test vectors...
train = data(is_training);
test = data(~is_training);

您还可以使用randsample function来获取提取中的一些随机性,但是每次运行脚本时,这都不会为您提供测试和训练元素的确切绘制数量:

% Generate sample data...
data = rand(32000,1);

% Generate a random true/false indexing with unequally weighted probabilities...
is_training = logical(randsample([0 1],32000,true,[0.2 0.8]));

% Split data into training and test vectors...
train = data(is_training);
test = data(~is_training);

您可以通过生成正确数量的测试和培训索引,然后使用基于randperm的索引对其进行混洗来避免此问题:

% Generate sample data...
data = rand(32000,1);
data_len = numel(data);

% Calculate the number of training entries...
train_count = round(data_len * 0.8);

% Create the logical indexing...
is_training = [true(train_count,1); false(data_len-train_count,1)];

% Shuffle the logical indexing...
is_training = is_training(randperm(32000));

% Split data into training and test vectors...
train = data(is_training);
test = data(~is_training);