Question

我在Oracle中有一个表，假设有99行。我想从表中取三个不同的（随机）样本，确保它们在三个不同的样本之间没有替换。也就是说，sample_1可以包含行1-33，sample_2可以包含行34-66，sample_3可以包含行67-99。我不想要在行之间存在重叠的样本，即sample_1包含行1-33，sample_2包含行21-53等...

我到目前为止的代码如下：

CREATE TABLE training_sample_1 AS SELECT * FROM pmaster_numeric SAMPLE BLOCK (33, 100)

我正在使用SAMPLE BLOCK，因为我的表实际上是大约1000万行，并且是一个种子值，因为我希望将来能够引用此示例。同样，我想再制作两个相同大小的样本 - training_sample_2和test_sample_1，确保三个样本表中没有一个包含也在其他样本表中的行（无替换）。

Answer 1

如果表中有增量ID列，则可以通过以下方式执行此操作。

我建议的是取MAX值并将其分解为3 parts，然后从这些集合中提取样本，因为SAMPLE子句在{{1}之上工作}}

WHERE

Answer 2

（已编辑）

这是我最终使用的SQL：

-- Create randomly-ordered pmaster_numeric table
CREATE TABLE random_order_pmaster_numeric AS 
SELECT pmast.*, dbms_random.value AS row_number
FROM pmaster_numeric pmast
ORDER BY dbms_random.value;

-- training_sample_1
CREATE TABLE training_sample_1 AS
SELECT *
FROM random_order_pmaster_numeric 
WHERE row_number > 0 and row_number <= .3;

-- training_sample_2
CREATE TABLE training_sample_2 AS
SELECT *
FROM random_order_pmaster_numeric 
WHERE row_number > .3 and row_number <= .6;

-- training_sample_3
CREATE TABLE training_sample_3 AS
SELECT *
FROM random_order_pmaster_numeric 
WHERE row_number > .6 and row_number <= .9;

-- test_sample_1
CREATE TABLE test_sample_1 AS
SELECT *
FROM random_order_pmaster_numeric 
WHERE row_number > .9;

我在这里做的是随机排序我的整个表，保持dbms_random.value（在注意到0-1区间内的值均匀分布之后）然后将其分成等大小（以#of为单位）行）），然后我指定为我的样本。这避免了替换，因为我选择行选择而不是SAMPLE BLOCK子句，并确保样本表之间没有行重叠（我的查询中的WHERE子句）。

此处，训练样本均为测试样本的3倍。

Answer 3

如果您想从表格中选择不在另一个表格中的所有内容，您可以使用MINUS。

CREATE TABLE training_sample_1 AS
SELECT *
FROM pmaster_numeric
SAMPLE BLOCK (33, 100);

CREATE TABLE training_sample_2 AS
select *
from
(
  SELECT *
  FROM pmaster_numeric
  MINUS
  select *
  from   training_sample_1
)
SAMPLE BLOCK (33, 100)

Answer 4

我认为您在这里错过了一个技巧，即选择一组数据，然后使用条件多值插入将其拆分为多个目标表。

修改你的选择以获取你想要的完整样本，然后使用从数据派生的无意义属性将每一行发送到特定的目标表。

只要你的样本是可重复的，你的辨别属性是根据数据确定的（所以不是rownum，我建议），你就会清楚。

此外，这种单通道方法在大型数据集上的表现会更好。

create table test0 (col1 number);

insert into test0
select rownum
from dual
connect by level <= 100;

create table test1 (col1 number);

create table test2 (col1 number);

create table test3 (col1 number);


insert all
  when tgt_table=1 then into test1(col1) values(col1)
  when tgt_table=2 then into test2(col1) values(col1)
  when tgt_table=3 then into test3(col1) values(col1)
select test0.col1,
       mod(rownum,3)+1 tgt_table
from   test0 sample(20);

select * from test1;

select * from test2;

select * from test3;

请参阅SQL Fiddle了解演示

无需更换多个样品即可进行取样

4 个答案: