使用Pig查找最常见的值

时间:2014-10-31 14:42:40

标签: apache-pig

我有以下数据集:

dump DATA_INPUT;
     (0000001686601081020,10A)
     (0000001686601081020,08D)
     (0000001686601081020,08D)
     (0000001686601081020,08D)
     (0000001686601081020,09D)
     (0000001686601081020,09D)
     (0000001686601081020,08D)
     (0000001686601081020,08D)
     (0000001686601081020,08D)
     (0000001686676950125,0A1)
     (0000001686676950125,0A1)
     (0000001686676950125,0A2)

列$ 0是account_id,列$ 1是cell_id。

对于每个account_id,我需要找到最常用的cell id。

我尝试做的第一步是:

 grpd = group DATA_INPUT by ($0, $1);
 cells_count  = foreach grpd GENERATE group, COUNT(DATA_INPUT.$1) AS count;
 all_cells_counts = GROUP cells_count BY group.$0;
    top_cell = FOREACH all_cells_counts {
        A = ORDER cells_count BY count DESC;
        B = LIMIT A 1;
        GENERATE FLATTEN(B.group);
    }

我得到的结果:

     ((0000001686601081020,08D))
     ((0000001686676950125,0A1))

我怎样才能摆脱括号(),进入rezult

     (0000001686601081020,08D)
     (0000001686676950125,0A1)

1 个答案:

答案 0 :(得分:1)

执行top_cell的FLATTEN

final_result = FOREACH top_cell GENERATE FLATTEN($0);