如何在Hive中实现“分组”采样?

时间:2016-11-22 08:30:03

标签: hive

给出一个Hive表:

create table mock
(user string,
 url string
);

如何为每个 url抽取一定比例的url(比如说​​50%)或一定数量的user

1 个答案:

答案 0 :(得分:1)

从表中提取样本有built-in query

SELECT * FROM mock TABLESAMPLE(50 PERCENT)

以下是使用row_number()的替代解决方案。首先为每个用户的每一行编号

with numbered as (
  SELECT user, url, row_number() OVER (PARTITION BY user ORDER BY user) as rn FROM mock
)

然后,只需使用pmod选择奇数行或偶数行即可获得50%样本

SELECT user, url FROM numbered where pmod(rn,2) = 0