在Pig中按两列聚合数据分组

时间:2015-12-04 23:57:33

标签: hadoop hive apache-pig

我有这些数据需要按两列分组,然后总结另外两个字段。 假设这四列的名称是:OS,device,view,click。我基本上想知道每个操作系统和设备的计数,它们有多少视图以及它有多少次点击。

(2,3346,1,)
(3,3953,1,1)
(25,4840,1,1)
(2,94840,1,1)
(14,0526,1,1)
(37,4864,1,)
(2,7353,1,)

这是我到目前为止所拥有的

A is data: OS,device,view,click

B = GROUP A BY (OS,device);

Result = FOREACH  B {
    GENERATE group AS OS,device, SUM(view) AS visits, SUM(click) AS clicks;};
dump Result; 

这个没有工作,错误信息是:架构中不存在投影字段[OS]:group:tuple(OS:int,device:long),B:bag {:tuple(OS:int)中,设备:长,视图:INT,单击:INT)}

2 个答案:

答案 0 :(得分:1)

以下是经过测试的代码,您缺少FLATTEN:

A = LOAD '/user/root/pig_data' using PigStorage(',') AS (OS, device, view, click);
B = GROUP A BY (OS, device);
RESULT = FOREACH B GENERATE FLATTEN(group) AS (OS, device), SUM(A.view) as views, SUM(A.click) as clicks;
dump RESULT;

答案 1 :(得分:0)

我认为您的示例中的B代替J2J3,这可能在您的实际代码中。尝试:

B = GROUP A BY (OS, device);

Result = FOREACH B GENERATE
    group.OS AS OS:int,
    group.device AS device:long,
    SUM(B.view) AS visits:int,
    SUM(B.click) AS clicks:int;

dump Result;