我在PIG中为3个科目加载了3个表,每个表都有名称:chararray和score:float。 所有科目都不一定会出现同名。
我需要将3个表中的分数添加到单个表中,具有名称和总分。
我曾经使用嵌套查询在SQL中执行此操作。 如何在PIG中这样做? 我尝试使用完全外连接,但在名称列中找到名称不存在的主题的空值后被卡住了。
答案 0 :(得分:0)
根据您对问题的描述,3个文件中的一个简单UNION
,后跟GROUP BY
应该会产生您正在寻找的结果。
<小时/>
data_1 = LOAD 'union1.csv' USING PigStorage(',') AS (name:chararray,score:float);
data_2 = LOAD 'union2.csv' USING PigStorage(',') AS (name:chararray,score:float);
data_3 = LOAD 'union3.csv' USING PigStorage(',') AS (name:chararray,score:float);
data = UNION data_1,data_2,data_3;
data_grp = GROUP data BY name;
data_gen = FOREACH data_grp GENERATE group, SUM(data.score);
dump data_gen;
bob,3
elvis,4
jim,4
dave,2
sneech,4
suess,3
giri,5
union2.csv
mike,2
rick,3
jim,3
giri,4
dave,3
elvis,5
union3.csv
bob,5
bing,4
suess,4
sneech,5
dave,4
jim,2
giri,2
(bob,8.0)
(jim,9.0)
(bing,4.0)
(dave,9.0)
(giri,11.0)
(mike,2.0)
(rick,3.0)
(elvis,9.0)
(suess,7.0)
(sneech,9.0)
答案 1 :(得分:0)
虽然您可以使用外部JOIN
执行此操作,但我认为UNION
将所有表格放在一起,然后GROUP
Name
字段将更容易
-- T1, T2, and T3 are the tables you have loaded. Each has the schema
-- TX: (Name: chararray, score: float)
F = UNION T1, T2, T3;
-- Even if the name appears in only one table, then the result of GROUP will only
-- have one item in the bag. This means we can use SUM regardless of how many tables
-- the name is in.
G = GROUP F BY Name;
H = FOREACH G GENERATE group AS Name, SUM(F.score) AS totalscore;