如何从表中获取开始和结束事件

时间:2015-02-21 16:29:47

标签: hadoop apache-pig

我在表格中有以下记录

session_id sequence timestamp
1           1       298349
1           2       299234
1           3       234255
2           1       153523
2           2       234524
3           1       123434 

我希望得到以下结果

session_id  start       end
1           298349      234255
2           153523      234524
3           123434      123434

我怎样才能在猪身上做到这一点?

1 个答案:

答案 0 :(得分:1)

register 'file:$piglib/datafu-1.2.0.jar';

define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();

input_data = load 'so.txt' using PigStorage('\t') as (session_id:int, sequence:int, time:long);

g = group input_data by session_id;

r = foreach g {
    s1 = order input_data by sequence asc;
    s2 = order input_data by sequence desc;
    generate group as session_id, FirstTupleFromBag(s1, null).time as start, FirstTupleFromBag(s2, null).time as end;
}

dump r;

首先,我们按session_id进行分组,然后按顺序升序和降序排序,分别取出已排序行李的第一个元组。

这使用了datafu UDF库(http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/FirstTupleFromBag.html