我有一个包含以下字段的表:
ID:chararray 日期:字符串 国家:字符串
我把这张桌子装进猪里。我的目标 - 按月计算每个国家/地区的ID数量。例如,这将是我所需的最终结果:
Country | Date1 | Date2 | Date 3|
USA | 140 | 160 | 200
China | 120 | 210 | 150
这些数字代表所有国家/地区的每个日期的ID数。
我不确定如何使用GROUP BY运算符。我首先尝试GROUP by(date)然后GROUP by(date,country)。但我不确定这是否会提供我所需的结果,因为我不完全理解GROUP的单列和多列。
对此的任何指导,想法和解释都非常感谢。
谢谢!
答案 0 :(得分:0)
如果您熟悉SQL中的GROUP BY运算符,那么在PIG中理解GROUP BY运算符时会遇到问题。
首先,我将使用COUNTRY和DATE进行GROUP BY
input = load '$input' USING AvroStorage(); -- or whatever LOAD storage function.
input2 = GROUP input BY Country, Date;
input3 = FOREACH input2
GENERATE
FLATTEN(group) AS (Country, Date),
COUNT(input) AS Count;
/*
input3 will look like below:
Country Date Count
USA Date1 140
USA Date2 160
USA Date3 200
China Date1 120
China Date2 210
China Date3 150
*/
-- Until this program is easy to write and to understand.
-- If you are fine with above schema, I would suggest to live with it.
-- If you really want to get the tabular format, it would involve lots of (31 to be precise = number of days in the month) JOINs on Country column.
-- To give an idea about single JOIN:
input3a = FOREACH input3 GENERATE Country, Date, Count;
input4 = JOIN input3 BY Country, input3a BY Country;
input5 = FOREACH input4
GENERATE
input3::Country AS Country,
input3::Date AS Date1,
input3::Count AS Date1Count,
input3a::Date AS Date2,
input3a::Count AS Date2Count;
如果您了解如何使用JOIN一起收集2个日期数据,您可以理解为什么我建议使用早期结构(input3)。
我希望这会有所帮助。