I have a file that contains the census information which I would like to query using Pig.
The file format is as follows:
ID Name Year Gender State Count
1 Jones 1980 M MA 100
I would like to get the percentage for each name for that state in that year for each year in the file
How can I loop through each of the years and calculate for each state the percentage of occurrences of each name?
The result should look as follows:
1901 Jones MA 2%
1901 Jones VT 3%
1901 Smith MA 1%
1901 Lee VT 4%
....
....
2016 Jones MA 2%
2016 Jones VT 3%
2016 Smith MA 1%
2016 Lee VT 4%
For every year in the table I need to break it down by state and within every state I need to calculate the percentage for each name given the count information.
答案 0 :(得分:2)
您必须通过年份状态生成另一种关系,使用年份,状态加入数据集与新关系,然后获取百分比。
见下文。
A = LOAD 'census_data' USING PigStorage('\t') as (int:id,name:chararray,year:chararray,gender:chararray,state:chararray,int:count);
B = GROUP A by (year,state);
C = FOREACH B GENERATE FLATTEN(group) as (year,state),SUM(A.count) as occurances;
D = JOIN A BY (year,state),C BY (year,state);
E = FOREACH D GENERATE A::year,A::name,A::state,CONCAT(A::count/C::occurances,'%'); --If you get an error try A.year,A.name,A.state,CONCAT(A.count/C.occurances,'%');
DUMP E;