我有以下数据集:
dump DATA_INPUT;
(0000001686601081020,10A)
(0000001686601081020,08D)
(0000001686601081020,08D)
(0000001686601081020,08D)
(0000001686601081020,09D)
(0000001686601081020,09D)
(0000001686601081020,08D)
(0000001686601081020,08D)
(0000001686601081020,08D)
(0000001686676950125,0A1)
(0000001686676950125,0A1)
(0000001686676950125,0A2)
列$ 0是account_id,列$ 1是cell_id。
对于每个account_id,我需要找到最常用的cell id。
我尝试做的第一步是:
grpd = group DATA_INPUT by ($0, $1);
cells_count = foreach grpd GENERATE group, COUNT(DATA_INPUT.$1) AS count;
all_cells_counts = GROUP cells_count BY group.$0;
top_cell = FOREACH all_cells_counts {
A = ORDER cells_count BY count DESC;
B = LIMIT A 1;
GENERATE FLATTEN(B.group);
}
我得到的结果:
((0000001686601081020,08D))
((0000001686676950125,0A1))
我怎样才能摆脱括号(),进入rezult
(0000001686601081020,08D)
(0000001686676950125,0A1)
答案 0 :(得分:1)
执行top_cell的FLATTEN
final_result = FOREACH top_cell GENERATE FLATTEN($0);