带有时间间隔的apache pig脚本

时间:2016-07-18 15:07:48

标签: hadoop apache-pig

我想每小时对每个端口的RW列进行SUM

Time     ID  Name               RW        
-------- --- -------         ----------
14:57:01 000 Port0            1340
14:57:01 001 Port1             13

14:58:01 000 Port0             864
14:58:01 001 Port1             36

14:59:01 000 Port0            1394
14:59:01 001 Port1             22

15:57:01 000 Port0            1340
15:57:01 001 Port1             13

15:58:01 000 Port0            864
15:58:01 001 Port1             36

15:59:01 000 Port0            1394
15:59:01 001 Port1             22
.
.
.

20:57:01 000 Port0            1340
20:57:01 001 Port1             13

20:58:01 000 Port0            864
20:58:01 001 Port1            36

20:59:01 000 Port0            1394
20:59:01 001 Port1             22

我的脚本是

data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int);
runs = FOREACH data GENERATE time, name, rw;

如何

1 个答案:

答案 0 :(得分:1)

您必须从名为hours的时间列生成新列,然后按小时,端口名称分组,然后获取每个分组的总和。

data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int);
runs = FOREACH data GENERATE GetHour((timestamp)time) as hour, name, rw;
grouped = GROUP runs by (hour,name);
port_total = FOREACH grouped GENERATE FLATTEN(group) as (hour,name),SUM(data.rw);
DUMP port_total;