Question

我需要根据175个人口统计选项从500万个观察值的表中抽取一个随机样本。人口统计表的格式如下：

基本上，我需要从5M行表中随机抽样的同一人口统计细分。对于每个人口统计，我都需要一个较大的表中的样本，但观察次数是原来的5倍（例如：对于人口统计1，我希望随机抽样200个样本）。

SELECT  *
FROM    (
        SELECT  *
        FROM    my_table
        ORDER BY
                dbms_random.value
        )
WHERE rownum <= 100;

我以前使用过这种语法来获取随机样本，但是有什么方法可以将其修改为循环并替换现有表中的变量名称？我将尝试用伪代码封装所需的逻辑：

for (each demographic_COLUMN in TABLE1) 
    select random(5*num_obs_COLUMN in TABLE1) from ID_COLUMN in TABLE2
/*somehow join the results of each step in the loop into one giant column of IDs */

Answer 1

您可以联接表（假设两者中都存在1-175人口统计值，或者存在要连接的等效列），如下所示：

select id
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage

主表中的每一行在其受众人口统计特点中均具有一个随机的伪行号（通过分析row_number()）。然后，外部查询会使用相关百分比来选择要返回的每个受众人口中随机排序的行中有多少行。

我不确定我是否真正了解了您实际上到底要选择多少个，因此可能需要进行调整。

在CTE中具有较小样本并匹配较小匹配条件的演示：

-- CTEs for sample data
with my_table (id, demographic) as (
  select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
            select 1, 40, '4%' from dual
  union all select 2, 30, '3%' from dual
  union all select 3, 30, '3%' from dual
  -- ...
  union all select 174, 2, '.02%' from dual
  union all select 175, 1, '.01%' from dual
)
-- actual query
select demographic, percentage, id, rn
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage;

DEMOGRAPHIC PERCENTAGE         ID         RN
----------- ---------- ---------- ----------
          1         40      94150          1
          1         40      36925          2
          1         40     154000          3
          1         40      82425          4
...
          1         40     154350        199
          1         40     126175        200
          2         30      36051          1
          2         30       1051          2
          2         30     100451          3
          2         30      18026        149
          2         30     151726        150
          3         30     125302          1
          3         30     152252          2
          3         30     114452          3
...
          3         30     104652        149
          3         30      70527        150
        174          2      35698          1
        174          2      67548          2
        174          2     114798          3
...
        174          2      70698          9
        174          2      30973         10
        175          1     139649          1
        175          1     156974          2
        175          1     145774          3
        175          1      97124          4
        175          1      40074          5

（您只需要ID，但我将其他列用于上下文）；或更简洁：

with my_table (id, demographic) as (
  select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
            select 1, 40, '4%' from dual
  union all select 2, 30, '3%' from dual
  union all select 3, 30, '3%' from dual
  -- ...
  union all select 174, 2, '.02%' from dual
  union all select 175, 1, '.01%' from dual
)
select demographic, percentage, count(id) as ids, min(id) as min_id, max(id) as max_id
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage
group by demographic, percentage
order by demographic;

DEMOGRAPHIC PERCENTAGE        IDS     MIN_ID     MAX_ID
----------- ---------- ---------- ---------- ----------
          1         40        200        175     174825
          2         30        150          1     174126
          3         30        150       2452     174477
        174          2         10      23448     146648
        175          1          5      19074     118649

db<>fiddle

从循环输出Oracle SQL创建表

1 个答案: