将包装袋中的元组值与硬编码的字符串值进行比较

时间:2019-02-09 10:23:16

标签: apache-pig

我有一个包含这些列的数据集:-

FMID,County,WIC,WICcash

以下是数据示例:-

1002267,Douglas,Y,N
21005876,Douglas,Y,N
1001666,Douglas,N,Y

我已根据县对数据进行了分组,并已根据County = 'Douglas'过滤了数据。输出如下:

(Douglas,{(1002267,Douglas,Y,N),(21005876,Douglas,Y,N),(1001666,Douglas,N,Y)})

现在,如果WICWICcash列的值为Y,那么我想对两列的值进行合并计数。

在这里,结合WICWICcash列,我有3个Y值,所以我的输出将是

Douglas 3

我该如何实现?

下面是我到目前为止编写的代码

load_data = LOAD 'PigPrograms/Markets/DATA_GOV_US_Farmers_Market_DataSet.csv' USING PigStorage(',') as (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);

group_markets_by_county = GROUP load_data BY County;

filter_county = FILTER group_markets_by_county BY group == 'Douglas';

DUMP filter_county;

1 个答案:

答案 0 :(得分:0)

要在袋子里看,可以使用嵌套的foreach。

A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = GROUP A by County;
describe B; /* B: {group: chararray,A: {(FMID: long,County: chararray,WIC: chararray,WICcash: chararray)}} */ 
C = FOREACH B {
        FILTER_WIC_Y = FILTER A by WIC == 'Y';
        COUNT_WIC_Y = COUNT(FILTER_WIC_Y);
        FILTER_WICcash_Y = FILTER A by WICcash == 'Y';
        COUNT_WICcash_Y = COUNT(FILTER_WICcash_Y);
        GENERATE group, COUNT_WIC_Y + COUNT_WICcash_Y as count;
}
dump C;

或者,您可以将'Y'&'N'替换为1&0,然后将它们加起来。

A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = FOREACH A GENERATE FMID, County, (WIC == 'Y' ? 1 : 0 ) as wic, (WICcash == 'Y' ? 1 : 0 ) as wiccash;
C = GROUP B by County;
D = FOREACH C GENERATE group, SUM(B.wic) + SUM(B.wiccash) as count;
dump D;