如何将Pig Latin的可变大小的输入发送到用户定义的函数

时间:2016-06-14 03:50:26

标签: java hadoop apache-pig user-defined-functions

我在Pig Latin / MapReduce中使用了一个简单的UDF。 Pig Latin查询是:

REGISTER \PigStringOperations.jar
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE:int,EDU:chararray,SEX:chararray,SALARY:chararray);
BV= group  sensitive by (EDU,SEX) ; 
BVA= foreach BV generate sensitive.AGE as AGE;
anon = FOREACH BVA  GENERATE PigStringOperations.StringSplit(sensitive.AGE);
DUMP anon;

UDF是一个简单的java程序 如下图所示

public String exec(Tuple input) throws IOException
  String data = (String)input.get(0);
if (data.contains(" "))
{
  this.data2 = data.split(" ");
  return this.data2[0].toString();
}
return data;}}

这取自成人数据库Adult database sample 分组的AGE输出(EDU,SEX)从一个元组到另一个元组不同,如下所示

AGE(12,10,35,20)
AGE(4,56,10)
AGE(70)

每次运行程序时,我都会收到以下错误:

ERROR 1066: Unable to open iterator for alias anon. Backend error : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (,EDU,SEX,SALARY), 2nd :(39,Bachelors,Male,<=50K)

1 个答案:

答案 0 :(得分:0)

对数据进行分组后,sensitive.AGE将成为一个包!在规划UDF时请考虑这一点。 如果您对预测执行DESCRIBE,例如:

DESCRIBE BVA;

这将有助于您了解数据结构并相应地规划您的处理。