尝试在pig

时间:2017-01-15 17:25:04

标签: hadoop apache-pig hdfs pig-udf

我的Python UDF代码:

#commaFormat- format a number with commas, 12345-> 12,345
 @outputSchema("numformat:chararray")
 def commaFormat(num):
   return '{:,}'.format(num)

我的猪脚本:

DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage;
A = LOAD '/result.csv' using CSVExcelStorage() As (id:int,lastvisitedtime:chararray,title:chararray,typedcount:int,URL:chararray,visitcount:int,bytes:int);
B = limit A 15;
REGISTER '/data/pyudf/test.py' USING streaming_python AS myudfs;
C = FOREACH B generate myudfs.commaFormat($1);

Pig Stack Trace:

  

---------------错误1002:无法存储别名C

     

org.apache.pig.impl.logicalLayer.FrontendException:ERROR 1066:无法使用   打开别名C at的迭代器   org.apache.pig.PigServer.openIterator(PigServer.java:1019)at   org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:747)     在   org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)     在   org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)     在   org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)     在org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)at   org.apache.pig.Main.run(Main.java:630)at   org.apache.pig.Main.main(Main.java:176)at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:606)at   org.apache.hadoop.util.RunJar.run(RunJar.java:221)at   org.apache.hadoop.util.RunJar.main(RunJar.java:136)引起:   org.apache.pig.PigException:错误1002:无法存储别名C at   org.apache.pig.PigServer.storeEx(PigServer.java:1122)at   org.apache.pig.PigServer.store(PigServer.java:1081)at   org.apache.pig.PigServer.openIterator(PigServer.java:994)... 13更多   引起:org.apache.pig.backend.executionengine.ExecException:ERROR   0:执行时出现异常(名称:C:   存储(HDFS://本地主机:54310 / tmp目录/ temp1063554930 / TMP-651585063:org.apache.pig.impl.io.InterStorage)    - scope-16运算符键:scope-16):org.apache.pig.impl.streaming.StreamingUDFException:LINE:KeyError:   'concatMult4'

     

在   org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)     在   org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNextTuple(POStore.java:159)     在   org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.runPipeline(FetchLauncher.java:157)     在   org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.launchPig(FetchLauncher.java:81)     在   org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:306)     在org.apache.pig.PigServer.launchPlan(PigServer.java:1474)at   org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1459)     在org.apache.pig.PigServer.storeEx(PigServer.java:1118)...还有15个   引起:org.apache.pig.impl.streaming.StreamingUDFException:LINE:   KeyError:'concatMult4'

     

在   org.apache.pig.impl.builtin.StreamingUDF $ ProcessErrorThread.run(StreamingUDF.java:503)

2 个答案:

答案 0 :(得分:0)

首先,你在define语句中缺少()。

REGISTER /path/piggybank.jar;
DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();

您可能正在使用Mortar的cPython发行版,它需要至少pig0.12。尝试使用jython脚本引擎。

REGISTER '/data/pyudf/test.py' USING jython AS myudfs;
C = FOREACH B generate myudfs.commaFormat($1);

或者,您可以使用REPLACE函数轻松删除逗号,而不是编写UDF。

REGISTER /path/piggybank.jar;
DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();
A = LOAD '/result.csv' using CSVExcelStorage() AS (id:int,lastvisitedtime:chararray,title:chararray,typedcount:int,URL:chararray,visitcount:int,bytes:int);
B = FOREACH A GENERATE id,REPLACE(lastvisitedtime,',',''),title,typedcount,URL,visitcount,bytes;
C = LIMIT B 15;
DUMP C;

答案 1 :(得分:0)

Pig不会处理带有依赖模块的Python UDF。 因此,您需要将它们包装在JAR中并将该文件注册为Pig脚本的一部分。

REGISTER '/data/pyudf/test.py' USING jython AS myudfs;

Python UDFs explained