在SQL中,每组不使用不同的样本大小进行抽样

时间:2018-02-07 14:02:39

标签: sql vertica sampling

使用提供的表我想每天随机抽样用户。要采样的用户数在to_sample列中指定,并由另一个查询填充。在这个例子中,我想对第一天的1次观察和第二天的2次观察进行抽样(但这会随着查询的每次执行而变化,所以不要对这些数字做出决定)。我希望分配到不同日期的用户不同(没有重叠的分配)。

drop table if exists test; 

create table test (
user_id int,
day_of_week int,
to_sample int);

insert into test values (1, 1, 1);
insert into test values (1, 2, 2);
insert into test values (2, 1, 1);
insert into test values (2, 2, 2);
insert into test values (3, 1, 1);
insert into test values (3, 2, 2);
insert into test values (4, 1, 1);
insert into test values (4, 2, 2);
insert into test values (5, 1, 1);
insert into test values (5, 2, 2);
insert into test values (6, 1, 1);
insert into test values (6, 2, 2);

预期结果如下:

create table results (
user_id int,
day_of_week int);

insert into results values (1, 1);
insert into results values (3, 2);
insert into results values (6, 2);

正如我所说,每次采样的用户数量都会有所不同,应该从测试表中的to_sample列中获取。此外,我将运行它7天,这里有2个保持示例简单。

修改

with day_1 as(
select t.user_id, t.day_of_week
from (select t.*, row_number() over (partition by day_of_week order by randomint(100)) as seqnum
      fromtest t where t.day_of_week = 1 
     ) t 
where t.seqnum <= (select distinct to_sample fromtest where day_of_week = 1)
)
, day_2 as(
select t.user_id, t.day_of_week
from (select t.*, row_number() over (partition by day_of_week order by randomint(100)) as seqnum
      from test t where t.user_id not in (select distinct user_id from day_1) and t.day_of_week = 2 
     ) t 
where t.seqnum <= (select distinct to_sample from test where day_of_week = 2) 
)
select * from day_1 union all select * from day_2

我尝试根据一些答案创建一个粗暴的解决方案,但仍然有一些重复的用户,即使我从day_2删除了day_1中已经使用的user_id

user_id | day_of_week
---------+-------------
       4 |           1
       4 |           2
       1 |           2

2 个答案:

答案 0 :(得分:1)

如果我找到你,请尝试下一步: (实际上它是@BHouse的改进解决方案)

SELECT
    T.user_id,
    T.day_of_week
FROM (
    SELECT
        user_id,
        day_of_week,
        to_sample,
        row_number() OVER (PARTITION BY to_sample ORDER BY randomint(max(user_id) + 1)) AS RN
    FROM
        test
    GROUP BY
        user_id,
        day_of_week,
        to_sample
    ORDER BY
        to_sample
    ) AS T
WHERE
    T.RN <= T.to_sample;

提供数据的输出示例:

第一次执行

 user_id | day_of_week
---------+-------------
       1 |           1
       3 |           2
       2 |           2

第二次执行

 user_id | day_of_week
---------+-------------
       1 |           1
       1 |           2
       4 |           2

第3次执行

 user_id | day_of_week
---------+-------------
       5 |           1
       4 |           2
       2 |           2

因此,保证了一些随机性。

更新

或试试这个:

 SELECT
    T.user_id,
    T.day_of_week
FROM (
    SELECT
        user_id,
        day_of_week,
        to_sample,
        row_number() OVER (PARTITION BY to_sample) AS RN,
        randomint(42) AS RANDOM_ORDER /* <<-- here is main problem, number should be >= max(user_id) + 1 */
    FROM
        test
    ORDER BY
        to_sample,
        RANDOM_ORDER
    ) AS T
WHERE
    T.RN <= T.to_sample;

第二种选择更快,但我没有针对重要案例进行测试。

答案 1 :(得分:0)

使用随机select USER_ID,day_of_week from ( select user_id,day_of_week, ROW_NUMBER() over ( order by user_id) rn from #test where day_of_week = 1 ) x where rn = 1 union all select USER_ID,day_of_week from ( select user_id,day_of_week, ROW_NUMBER() over ( order by user_id) rn from #test where day_of_week = 2 ) x where rn in (3,6) ,您将获得此必需的样本输出

for i in range(min(len(std_answer), len(answer))):
    if answer[i] != std_answer[i]:
        wrong_ans.append(i)