在PIG

时间:2016-03-15 09:37:55

标签: hadoop join apache-pig

需要帮助在pig Latin中丢弃全外连接结果中的空值。以下是两个数据集:

A:

(BOS,2)
(BUR,81)
(LAS,8)

B:

(BUR,56)
(EWR,2)
(LAS,88)

完全外连接后: C:

(BOS,2,,)
(BUR,81,BUR,56)
(,,EWR,2)
(LAS,8,LAS,88)

我需要以下面的格式获得输出:

(BOS,2)
(BUR,137)
(EWR,2)
(LAS,96)

尝试了group by,flatten,bagtotuple的不同组合......但是无法弄清楚解决方案。非常感谢您的帮助。

airline = load '/demo/data/airline/airline.csv' using PigStorage(',') as (Origin: chararray, Dest: chararray); 
traffic_in = GROUP airline by Origin; 
traffic_in_count= FOREACH traffic_in generate group as Origin , COUNT(airline) as count ; 
traffic_out = GROUP airline by Dest; 
traffic_out_count = FOREACH traffic_out generate group as Dest ,COUNT (airline) as count; 
traffic_top = JOIN traffic_in_count by Origin FULL OUTER , traffic_out_count by Dest ;

1 个答案:

答案 0 :(得分:0)

修改 而不是使用OUTER JOIN,而是使用UNION,然后使用SUM第二列值。

A = LOAD 'test1.txt' using PigStorage(',') as (A1:chararray, A2:int); 
B = LOAD 'test2.txt' using PigStorage(',') as (B1:chararray, B2:int); 
C = UNION A,B;
D = GROUP C BY $0;
E = FOREACH D GENERATE group,SUM(C.$1);
DUMP E;

<强>输出

Total