我有这些数据需要按两列分组,然后总结另外两个字段。 假设这四列的名称是:OS,device,view,click。我基本上想知道每个操作系统和设备的计数,它们有多少视图以及它有多少次点击。
(2,3346,1,)
(3,3953,1,1)
(25,4840,1,1)
(2,94840,1,1)
(14,0526,1,1)
(37,4864,1,)
(2,7353,1,)
这是我到目前为止所拥有的
A is data: OS,device,view,click
B = GROUP A BY (OS,device);
Result = FOREACH B {
GENERATE group AS OS,device, SUM(view) AS visits, SUM(click) AS clicks;};
dump Result;
这个没有工作,错误信息是:架构中不存在投影字段[OS]:group:tuple(OS:int,device:long),B:bag {:tuple(OS:int)中,设备:长,视图:INT,单击:INT)}
答案 0 :(得分:1)
以下是经过测试的代码,您缺少FLATTEN:
A = LOAD '/user/root/pig_data' using PigStorage(',') AS (OS, device, view, click);
B = GROUP A BY (OS, device);
RESULT = FOREACH B GENERATE FLATTEN(group) AS (OS, device), SUM(A.view) as views, SUM(A.click) as clicks;
dump RESULT;
答案 1 :(得分:0)
我认为您的示例中的B
代替J2
或J3
,这可能在您的实际代码中。尝试:
B = GROUP A BY (OS, device);
Result = FOREACH B GENERATE
group.OS AS OS:int,
group.device AS device:long,
SUM(B.view) AS visits:int,
SUM(B.click) AS clicks:int;
dump Result;