我是PIG的新手,我正在尝试创建一个UDF,它获取一个元组并根据分隔返回多个元组。所以我写了一个UDF来读取下面的数据文件
2012/01/01 Name1 Category1|Category2|Category3
2012/01/01 Name2 Category2|Category3
2012/01/01 Name3 Category1|Category5
基本上我正在尝试阅读$ 2字段
Category1|Category2|Category3
Category2|Category3
Category1|Category5
将输出设为: -
Category1, Category2, Category3
Category2, Category3
Category1, Category5
下面是我写的UDF代码..
package com.test.multipleTuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class TupleToMultipleTuple extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
// Keep the count of every cell in the
Tuple aux = TupleFactory.getInstance().newTuple();
if (input == null || input.size() == 0)
return null;
try {
String del = "\\|";
String str = (String) input.get(0);
String field[] = str.split(del);
for (String nxt : field) {
aux.append(nxt.trim().toString());
}
} catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
return aux.toDelimitedString(",");
}
}
创建了Jar - &gt; TupleToMultipleTuple.jar
但我在执行时遇到以下错误。
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias B
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B
at org.apache.pig.PigServer.openIterator(PigServer.java:892)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:547)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:884)
... 13 more
请帮助我纠正这个问题。感谢。
用于应用UDF的Pig脚本..
REGISTER TupleToMultipleTuple.jar;
DEFINE myFunc com.test.multipleTuple.TupleToMultipleTuple();
A = load 'data.txt' USING PigStorage(' ');
B = foreach A generate myFunc($2);
dump B;
答案 0 :(得分:1)
您可以使用内置的分割功能:
flatten(STRSPLIT($2,'[|]',3))as(cat1:chararray,cat2:chararray,cat3:chararray)
您将获得名为cat1
,cat2
和cat2
的3个元组,其类型为chararray
,并由它们所属的关系的当前分隔符分隔。
答案 1 :(得分:0)
发现问题..问题是在解析DayaByteArray到String时...用toString()来修复它
package com.test.multipleTuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class TupleToMultipleTuple extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
// Keep the count of every cell in the
Tuple aux = TupleFactory.getInstance().newTuple();
if (input == null || input.size() == 0)
return null;
try {
String del = "\\|";
String str = (String) input.get(0).toString();
String field[] = str.split(del);
for (String nxt : field) {
aux.append(nxt.trim().toString());
}
} catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
return aux.toDelimitedString(",");
}
}