猪:过滤掉关系中的最后一个元组

时间:2016-04-20 19:21:12

标签: hadoop apache-pig

我在hdfs中有以下数据,我想删除最后一行。

/user/cloudera/test/testfile.csv

Day,TimeCST,Conditions
1,12:53 AM,Clear
1,1:53 AM,Clear
1,2:53 AM,Clear
1,3:53 AM,Clear
1,4:53 AM,Clear
1,5:53 AM,Clear
1,6:53 AM,Clear
1,7:53 AM,Clear
1,8:53 AM,Clear
1,9:53 AM,Clear
1,10:53 AM,Clear
1,11:53 AM,Clear
1,12:53 PM,Clear
1,1:53 PM,Clear
1,2:53 PM,Clear
1,3:53 PM,Clear
1,4:53 PM,Clear
1,5:53 PM,Clear

首先,我load数据,删除标题,并获取行数/元组数:

rawdata = LOAD 'hdfs:/user/cloudera/test/testfile.csv' using PigStorage(',') AS (day:int, timecst:chararray, condition:chararray);
filtereddata = FILTER rawdata BY day > 0; --filters out header
rowcount = FOREACH (GROUP filtereddata ALL) GENERATE COUNT_STAR(filtereddata);
dump rowcount; --Prints (18)

接下来,我rank数据,然后尝试使用生成的行号到最后一行/元组的filter

ranked = RANK filtereddata;
weatherdata = FILTER ranked BY $0 != rowcount.$0;

以上filter操作失败,并显示以下错误:

ERROR 2017: Internal error creating job configuration.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias weatherdata.....

但是,如果我将rowcount硬编码到我的脚本中,如下所示,作业运行良好:

weatherdata = FILTER ranked BY $0 != 18;

我想避免对rowcount进行硬编码。你有没有发现我可能误入歧途的地方?感谢。

Apache Pig版本0.12.0-cdh5.5.0(rexported) 编辑2015年11月9日,12:41:48

2 个答案:

答案 0 :(得分:0)

可能必须施展它

weatherdata = FILTER ranked BY $0 != (int)rowcount.$0;

答案 1 :(得分:0)

使用转换和命名的组合变量似乎可以解决问题。以下作品:

rawdata = LOAD 'hdfs:/home/hduser/test/testfile.csv' using PigStorage(',') AS (day:int, timecst:chararray, condition:chararray);
filtereddata = FILTER rawdata BY day > 0; --filters out header
rowcount = FOREACH (GROUP filtereddata ALL) GENERATE COUNT_STAR(filtereddata) AS mycount:long;
ranked = RANK filtereddata;
weatherdata = FILTER ranked BY $0 != rowcount.mycount;
dump weatherdata;