Pig Query - 选择整个包

时间:2013-12-04 21:30:31

标签: apache-pig

如果我有这样的包:

 ({(11983070,39010451,1139539437),(11983070,53425518,11000)})

我想选择具有MAX最后一个价值($ 2)的整个行李,但每个行李只能自行获得MAX值。

我希望输出为

{(11983070,39010451,1139539437)}

但无法让它发挥作用。有什么想法吗?

2 个答案:

答案 0 :(得分:1)

想法是首先找到MAX,然后将MAX值作为额外列添加,然后过滤掉所有不满足$ 2 == $ maxValue的行。

遵循粗略的代码 - 改编自this解决方案

records = LOAD 'input.txt'  AS (first:int, second:int, third:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group 
       GENERATE
           FLATTEN(records.(first, second, third)), MAX(records.third) as max_third;
max_row = FILTER with_max BY records.third == max_third

答案 1 :(得分:1)

虽然你可以在纯猪中这样做,但使用UDF应该更有效率。它也很简单:

<强> myudfs.py

#!/usr/bin/python

@outputschema('Values:{(first:int, second:int, third:int)}')
def get_max(BAG)
    v = max(BAG, key=lambda x: x[2])

    # Since you want it to return in a bag, v needs to be in a list
    return [v]

猪脚本

REGISTER 'myudfs.py' USING jython AS myudfs ;

-- A is your input
B = FOREACH A GENERATE myudfs.get_max(my_input_bag) ;