Question

I have a file that contains the census information which I would like to query using Pig.

The file format is as follows:

ID Name Year Gender State Count

1 Jones 1980   M      MA   100

I would like to get the percentage for each name for that state in that year for each year in the file

How can I loop through each of the years and calculate for each state the percentage of occurrences of each name?

The result should look as follows:

    1901 Jones MA 2%
    1901 Jones VT 3%
    1901 Smith MA 1%
    1901 Lee   VT 4%
    ....
    ....

    2016 Jones MA 2%
    2016 Jones VT 3%
    2016 Smith MA 1%
    2016 Lee   VT 4%

For every year in the table I need to break it down by state and within every state I need to calculate the percentage for each name given the count information.

Answer 1

您必须通过年份状态生成另一种关系，使用年份，状态加入数据集与新关系，然后获取百分比。

见下文。

A = LOAD 'census_data' USING PigStorage('\t') as (int:id,name:chararray,year:chararray,gender:chararray,state:chararray,int:count);
B = GROUP A by (year,state);
C = FOREACH B GENERATE FLATTEN(group) as (year,state),SUM(A.count) as occurances;
D = JOIN A BY (year,state),C BY (year,state);
E = FOREACH D GENERATE A::year,A::name,A::state,CONCAT(A::count/C::occurances,'%'); --If you get an error try A.year,A.name,A.state,CONCAT(A.count/C.occurances,'%');
DUMP E;

How to loop using Pig for unique values

1 个答案: