如何根据百分位数过滤表,然后在HQL中随机抽样?

时间:2018-05-08 15:53:06

标签: sql hive hql hiveql hue

我试图从表格中随机抽样200行,但首先我要过滤它以从变量中仅选择前1%的值。

我收到以下错误 -

  

编译语句时出错:FAILED:ParseException第3:31行   无法识别附近的输入'选择' ' percentile_approx' '(' in   表达规范

以下是我的查询 -

> with sample_pop as (select * from
> mytable a where
> a.transaction_amount > (select
> percentile_approx(transaction_amount, 0.99) as top1
>                             from mytable) )
> 
> select * from sample_pop  distribute by rand(1) sort by rand(1) limit
> 200;

2 个答案:

答案 0 :(得分:0)

我认为Hive不支持标量子查询的使用方式(仅适用于IN / EXISTS)。所以将逻辑移到FROM子句:

with sample_pop as (
      select *
      from mytable a cross join
           (select percentile_approx(transaction_amount, 0.99) as top1
            from mytable
           ) aa
      where a.transaction_amount > aa.top1
     )
select * 
from sample_pop distribute by rand(1) 
order by rand(1)
limit 200;

答案 1 :(得分:0)

使用以下查询解决了我的问题 -

with sample_pop as (select a.* from 
          (
          select *, cum_dist() over (order by transaction_amount asc) pct
          from mytable
          ) a
where pct >= 0.99
)
select * 
from sample_pop distribute by rand(1) 
order by rand(1)
limit 200;