基于猪的ID过滤组

时间:2015-11-12 03:01:01

标签: hadoop apache-pig

我正在尝试根据第一个架构中的ID从架构中过滤掉一组描述。

我是猪的新手,所以很难掌握这一点。

以下是我构建的代码无效:

changeReason = LOAD 'Change_Reason.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('|', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
    AS (changeReasonID:int, reasonName:chararray);
price = LOAD '$directory/Price.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('|', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
    AS (priceID:int, changeReasonID:int);

priceChangeReasonIDs = GROUP price BY changeReasonID;
subGroup = FOREACH priceChangeReasonIDs
{
    change = FILTER changeReason BY changeReasonID == group.changeReasonId;
    GENERATE group AS changeID, change.reasonName AS Reason;
};

该代码给出了以下错误:

Failed to parse: Pig script failed to parse: 
<file load_historical_price.pig, line 108, column 20> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)

1 个答案:

答案 0 :(得分:0)

这个工作示例可以帮助:

如果我理解你,你想在group元素上过滤多组数据。

所以这是我的示例脚本:

data = LOAD 'SO/data.txt' USING PigStorage(' ') AS (val:int, id1:chararray, id2:int);
DESCRIBE data;
dgroup = GROUP data BY (id1, id2);
DESCRIBE dgroup;
dfilter = FILTER dgroup BY group.id1 == 'B';
DESCRIBE dfilter;
DUMP dfilter;

按id1过滤分组的(id1,id2)数据。

示例输入:

12 A 1
22 A 2
32 B 1
33 B 2
43 B 1
55 A 2
77 B 2
88 A 1 

DUMP的结果:

((B,1),{(43,B,1),(32,B,1)})
((B,2),{(77,B,2),(33,B,2)})

这是你想做的事情吗?