了解Pig中的GROUP BY运算符

时间:2014-09-16 22:41:54

标签: sql group-by apache-pig

我有一个包含以下字段的表:

ID:chararray 日期:字符串 国家:字符串

我把这张桌子装进猪里。我的目标 - 按月计算每个国家/地区的ID数量。例如,这将是我所需的最终结果:

Country | Date1 | Date2 | Date 3| 

USA     | 140   |  160  | 200

China   | 120   |  210  | 150

这些数字代表所有国家/地区的每个日期的ID数。

我不确定如何使用GROUP BY运算符。我首先尝试GROUP by(date)然后GROUP by(date,country)。但我不确定这是否会提供我所需的结果,因为我不完全理解GROUP的单列和多列。

对此的任何指导,想法和解释都非常感谢。

谢谢!

1 个答案:

答案 0 :(得分:0)

如果您熟悉SQL中的GROUP BY运算符,那么在PIG中理解GROUP BY运算符时会遇到问题。

首先,我将使用COUNTRY和DATE进行GROUP BY

input = load '$input' USING AvroStorage(); -- or whatever LOAD storage function.
input2 = GROUP input BY Country, Date;
input3 = FOREACH input2 
            GENERATE 
                FLATTEN(group) AS (Country, Date), 
                COUNT(input) AS Count;

/*
input3 will look like below:

Country    Date       Count
USA        Date1      140
USA        Date2      160
USA        Date3      200
China      Date1      120
China      Date2      210
China      Date3      150

*/

-- Until this program is easy to write and to understand.
-- If you are fine with above schema, I would suggest to live with it.
-- If you really want to get the tabular format, it would involve lots of (31 to be precise = number of days in the month) JOINs on Country column.  
-- To give an idea about single JOIN:

input3a = FOREACH input3 GENERATE Country, Date, Count;
input4 = JOIN input3 BY Country, input3a BY Country;

input5 = FOREACH input4 
             GENERATE
                 input3::Country AS Country,
                 input3::Date AS Date1,
                 input3::Count AS Date1Count,
                 input3a::Date AS Date2,
                 input3a::Count AS Date2Count;

如果您了解如何使用JOIN一起收集2个日期数据,您可以理解为什么我建议使用早期结构(input3)。

我希望这会有所帮助。