我的数据行如下所示:
user startTimestamp endTimestamp durationForeground application
11 1409239321000 1409239395000 1 IEXPLORE.EXE
11 1409239322000 0 19 IEXPLORE.EXE
11 1409239341000 0 18 IEXPLORE.EXE
11 1409239359000 0 0 IEXPLORE.EXE
11 1409239359000 0 7 IEXPLORE.EXE
11 1409239366000 0 6 IEXPLORE.EXE
11 1409239372000 0 10 IEXPLORE.EXE
11 1409239382000 0 13 IEXPLORE.EXE
11 1409239395000 1409239446000 9 MSPAINT.EXE
11 1409239404000 0 4 MSPAINT.EXE
11 1409239408000 0 13 MSPAINT.EXE
11 1409239421000 0 12 MSPAINT.EXE
11 1409239433000 0 5 MSPAINT.EXE
11 1409239438000 0 8 MSPAINT.EXE
我希望能够为每个小组加上所有durationForegrounds的总和;其中一个组以一个具有endTimestamp的行开始,并在下一次启动之前完成该行。
(原因是endTimestamp和startTimestamp之间的差异给我们app的运行时间,durationForeground的总和给我们应用程序在前台的时间。)
可以用Pig完成吗?
答案 0 :(得分:2)
您可能需要选择user
和application
对数据进行分组,并得到durationForeground
的总和。
示例
<强>输入强>
11 1409239321000 1409239395000 1 IEXPLORE.EXE
11 1409239322000 0 19 IEXPLORE.EXE
11 1409239341000 0 18 IEXPLORE.EXE
11 1409239359000 0 0 IEXPLORE.EXE
11 1409239359000 0 7 IEXPLORE.EXE
11 1409239366000 0 6 IEXPLORE.EXE
11 1409239372000 0 10 IEXPLORE.EXE
11 1409239382000 0 13 IEXPLORE.EXE
11 1409239395000 1409239446000 9 MSPAINT.EXE
11 1409239404000 0 4 MSPAINT.EXE
11 1409239408000 0 13 MSPAINT.EXE
11 1409239421000 0 12 MSPAINT.EXE
11 1409239433000 0 5 MSPAINT.EXE
11 1409239438000 0 8 MSPAINT.EXE
<强> PigScript:强>
A = LOAD 'input' USING PigStorage() AS(user:int,startTimestamp:long,endTimestamp:long,durationForeground:long,application:chararray);
B = GROUP A BY (user,application);
C = FOREACH B GENERATE FLATTEN(group),SUM(A.durationForeground);
DUMP C;
<强>输出:强>
(11,MSPAINT.EXE,51)
(11,IEXPLORE.EXE,74)
在上面的方法中,我假设所有输入字段都由制表符(\ t)分隔。