假设我的表格有两列 CUSTTYPE 和 AMOUNT 。我想添加第三列 NTILE ,我可以将其分组并用于获取我的平均值,如下所示:
CUSTTYPE | AMOUNT | NTILE
----------+---------+----------
RETAIL | 78.00 | 1
RETAIL | 234.00 | 1
RETAIL | 249.00 | 1
RETAIL | 278.00 | 2
RETAIL | 392.00 | 2
RETAIL | 498.00 | 2
RETAIL | 500.00 | 3
RETAIL | 738.00 | 3
RETAIL | 1250.00 | 3
RETAIL | 2029.00 | 4
RETAIL | 2393.00 | 4
RETAIL | 3933.00 | 4
基本上,我试图取每n个项的平均值(这里,n = 3):
CUSTTYPE | AMOUNT | NTILE
----------+---------+----------
RETAIL | 187.00 | 1
RETAIL | 389.33 | 2
RETAIL | 829.33 | 3
RETAIL | 2785.0 | 4
从Pig参考here,似乎可以使用Over()
来实现,但我找不到如何做到这一点的示例。想法?
答案 0 :(得分:2)
您可以使用RANK
运算符对数据的每条记录进行排名:
http://pig.apache.org/docs/r0.14.0/basic.html#rank
像这样:A = LOAD 'path' AS (schema);
B = RANK A;
然后将每个等级除以3:
C = FOREACH B generate ($0 + 1) / 3 as NTILE, CUSTTYPE, AMOUNT;