通过在PIG中的同一块内计算的条件值在FOREACH块内过滤

时间:2014-03-05 11:14:20

标签: foreach apache-pig conditional-statements

我有一个日志数据集,我需要在失败后过滤掉设备的所有日志条目(Action = 2)。

在这个例子中:

EquipId, ScvId, Action, TimeStamp
Ag,01,1,14-01-01 0:00:01
Ag,01,1,14-01-02 0:00:01
Ag,01,2,14-01-03 0:00:01
Ag,01,1,14-01-04 0:00:01
Ag,01,1,14-01-05 0:00:01
Ag,01,2,14-01-06 0:00:01
Ag,01,1,14-01-07 0:00:01
Ra,01,1,14-01-01 0:00:01
Ra,01,1,14-01-02 0:00:01
Ra,01,1,14-01-03 0:00:01
Ra,01,2,14-01-04 0:00:01
Fe,01,2,14-01-03 0:00:01
Fe,01,1,14-01-03 0:00:02
Fe,01,1,14-01-04 0:00:01
Lu,01,1,14-01-05 0:00:01
Lu,01,1,14-01-04 0:00:01
Lu,01,1,14-01-05 0:00:01

预期输出为

Ag,01,1,14-01-01 0:00:01
Ag,01,1,14-01-02 0:00:01
Ag,01,2,14-01-03 0:00:01
Ra,01,1,14-01-01 0:00:01
Ra,01,1,14-01-02 0:00:01
Ra,01,1,14-01-03 0:00:01
Ra,01,2,14-01-04 0:00:01
Fe,01,2,14-01-03 0:00:01
Lu,01,1,14-01-05 0:00:01
Lu,01,1,14-01-04 0:00:01
Lu,01,1,14-01-05 0:00:01

我试图在一个FOREACH块中编程,如下所示:

rawData = LOAD './test.csv'  USING PigStorage(',') AS (equipId:chararray, svcId:chararray, action:chararray, date:chararray);

equipDataGrp = GROUP rawData BY equipId;

minFail = FOREACH equipDataGrp {

    actionFail = FILTER rawData BY action == '2';
    minFailDate = MIN(actionFail.date);
    prevActionsFail = FILTER rawData BY date <= minFailDate;


    GENERATE group as equipId, FLATTEN(prevActionsFail.date);

};

我收到以下错误:

2014-03-05 11:08:11,720 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: 
<line 36, column 28> Invalid field reference. Referenced field [date] does not exist in schema: .

如果我将日期硬编码为:

minFail = FOREACH equipDataGrp {

    actionFail = FILTER rawData BY action == '2';
    minFailDate = MIN(actionFail.date);
    prevActionsFail = FILTER rawData BY date == '14-01-03 0:00:01';


    GENERATE group as equipId, FLATTEN(prevActionsFail.date);

};

我得到回应:

(Ag,14-01-03 0:00:01)
(Fe,14-01-03 0:00:01)
(Ra,14-01-03 0:00:01)

有什么建议吗?

提前致谢!

1 个答案:

答案 0 :(得分:5)

您需要计算故障时间并将其分配给设备ID的所有记录。然后,您可以使用晚于时间戳的时间戳过滤记录:

rawData = LOAD './test.csv'  USING PigStorage(',') AS (equipId:chararray, svcId:chararray, action:chararray, date:chararray);

equipDataGrp = GROUP rawData BY equipId;

/* Expand out into all records again, appending the earliest failure time */
minFail = FOREACH equipDataGrp {
    actionFail = FILTER rawData BY action == '2';
    GENERATE FLATTEN(rawData), MIN(actionFail.date) AS failTime;
};

notYetFailed = FOREACH (FILTER minFail BY date <= failTime) GENERATE equipId .. date;