我需要根据175个人口统计选项从500万个观察值的表中抽取一个随机样本。人口统计表的格式如下:
1 40 4%
2 30 3%
3 30 3%
- -
174 2 .02%
175 1 .01%
基本上,我需要从5M行表中随机抽样的同一人口统计细分。对于每个人口统计,我都需要一个较大的表中的样本,但观察次数是原来的5倍(例如:对于人口统计1,我希望随机抽样200个样本)。
SELECT *
FROM (
SELECT *
FROM my_table
ORDER BY
dbms_random.value
)
WHERE rownum <= 100;
我以前使用过这种语法来获取随机样本,但是有什么方法可以将其修改为循环并替换现有表中的变量名称?我将尝试用伪代码封装所需的逻辑:
for (each demographic_COLUMN in TABLE1)
select random(5*num_obs_COLUMN in TABLE1) from ID_COLUMN in TABLE2
/*somehow join the results of each step in the loop into one giant column of IDs */
答案 0 :(得分:1)
您可以联接表(假设两者中都存在1-175人口统计值,或者存在要连接的等效列),如下所示:
select id
from (
select d.demographic, d.percentage, t.id,
row_number() over (partition by d.demographic order by dbms_random.value) as rn
from demographics d
join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage
主表中的每一行在其受众人口统计特点中均具有一个随机的伪行号(通过分析row_number()
)。然后,外部查询会使用相关百分比来选择要返回的每个受众人口中随机排序的行中有多少行。
我不确定我是否真正了解了您实际上到底要选择多少个,因此可能需要进行调整。
在CTE中具有较小样本并匹配较小匹配条件的演示:
-- CTEs for sample data
with my_table (id, demographic) as (
select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
select 1, 40, '4%' from dual
union all select 2, 30, '3%' from dual
union all select 3, 30, '3%' from dual
-- ...
union all select 174, 2, '.02%' from dual
union all select 175, 1, '.01%' from dual
)
-- actual query
select demographic, percentage, id, rn
from (
select d.demographic, d.percentage, t.id,
row_number() over (partition by d.demographic order by dbms_random.value) as rn
from demographics d
join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage;
DEMOGRAPHIC PERCENTAGE ID RN
----------- ---------- ---------- ----------
1 40 94150 1
1 40 36925 2
1 40 154000 3
1 40 82425 4
...
1 40 154350 199
1 40 126175 200
2 30 36051 1
2 30 1051 2
2 30 100451 3
2 30 18026 149
2 30 151726 150
3 30 125302 1
3 30 152252 2
3 30 114452 3
...
3 30 104652 149
3 30 70527 150
174 2 35698 1
174 2 67548 2
174 2 114798 3
...
174 2 70698 9
174 2 30973 10
175 1 139649 1
175 1 156974 2
175 1 145774 3
175 1 97124 4
175 1 40074 5
(您只需要ID,但我将其他列用于上下文);或更简洁:
with my_table (id, demographic) as (
select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
select 1, 40, '4%' from dual
union all select 2, 30, '3%' from dual
union all select 3, 30, '3%' from dual
-- ...
union all select 174, 2, '.02%' from dual
union all select 175, 1, '.01%' from dual
)
select demographic, percentage, count(id) as ids, min(id) as min_id, max(id) as max_id
from (
select d.demographic, d.percentage, t.id,
row_number() over (partition by d.demographic order by dbms_random.value) as rn
from demographics d
join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage
group by demographic, percentage
order by demographic;
DEMOGRAPHIC PERCENTAGE IDS MIN_ID MAX_ID
----------- ---------- ---------- ---------- ----------
1 40 200 175 174825
2 30 150 1 174126
3 30 150 2452 174477
174 2 10 23448 146648
175 1 5 19074 118649