Question

我有一个这样的消费者表。

consumer | product | quantity
-------- | ------- | --------
a        | x       | 3
a        | y       | 4
a        | z       | 1
b        | x       | 3
b        | y       | 5
c        | x       | 4

我想要的是分配给每个消费者的“标准化”排名，以便我可以轻松地拆分表格进行测试和培训。我在hive中使用了dense_rank（），所以我得到了下表。

rank | consumer | product | quantity
---- | -------- | ------- | --------
1    | a        | x       | 3
1    | a        | y       | 4
1    | a        | z       | 1
2    | b        | x       | 3
2    | b        | y       | 5
3    | c        | x       | 4

这很好，但是我希望将其扩展到与任意数量的消费者一起使用，所以我理想地喜欢0到1之间的等级范围，就像这样。

rank | consumer | product | quantity
---- | -------- | ------- | --------
0.33 | a        | x       | 3
0.33 | a        | y       | 4
0.33 | a        | z       | 1
0.67 | b        | x       | 3
0.67 | b        | y       | 5
1    | c        | x       | 4

这样，我总是知道排名范围是什么，并且可以以标准方式分割数据（排名<= 0.7训练，排名＆gt; 0.7测试）

有没有办法在蜂巢中实现这一目标？

或者，对于我原来拆分数据的问题，有没有不同的更好的方法？

我尝试做select * where rank < 0.7*max(rank)，但是hive说MAX UDAF在where子句中还没有。

Answer 1

<强> PERCENT_RANK

select  percent_rank() over (order by consumer) as pr
       ,* 

from    mytable
;

+-----+----------+---------+----------+
| pr  | consumer | product | quantity |
+-----+----------+---------+----------+
| 0.0 | a        | z       |        1 |
| 0.0 | a        | y       |        4 |
| 0.0 | a        | x       |        3 |
| 0.6 | b        | y       |        5 |
| 0.6 | b        | x       |        3 |
| 1.0 | c        | x       |        4 |
+-----+----------+---------+----------+

对于过滤，您需要一个子查询/ CTE

select  *

from   (select  percent_rank() over (order by consumer) as pr
               ,* 

        from    mytable
        ) t

where   pr <= ...
;

是否可以在配置单元中执行“规范化”的dense_rank（）？

1 个答案: