给出一个Hive表:
create table mock
(user string,
url string
);
如何为每个 url
抽取一定比例的url
(比如说50%)或一定数量的user
?
答案 0 :(得分:1)
从表中提取样本有built-in query。
SELECT * FROM mock TABLESAMPLE(50 PERCENT)
以下是使用row_number()
的替代解决方案。首先为每个用户的每一行编号
with numbered as (
SELECT user, url, row_number() OVER (PARTITION BY user ORDER BY user) as rn FROM mock
)
然后,只需使用pmod
选择奇数行或偶数行即可获得50%样本
SELECT user, url FROM numbered where pmod(rn,2) = 0