在PIG中传递一个包作为UDF的输入

时间:2017-02-16 13:46:00

标签: hadoop mapreduce apache-pig

我试图将数据条(最终)作为输入传递。

 dump final;

给出: -

(4,john,john,David,Banking ,4,M,20-01-1994,78.65,345000,Arkansasdest1,Destination)
(4,john,john,David,Banking ,4,M,20-01-1994,78.65,345000,Arkanssdest2,Destination)
(4,johns,johns,David,Banking ,4,M,20-01-1994,78.65,345000,ArkansasSrc1,source)
(4,johns,johns,David,Banking ,4,M,20-01-1994,78.65,345000,ArkansaSrc2,source)

我即将编写一个UDF来处理上述数据条并找到Source和Destination之间的不匹配,为了做到这一点,我必须检查我的UDF是否接受数据条。所以我在下面写了一个样本UDF:

package PigUDFpck;

import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;


public class databag extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();

public DataBag exec(Tuple input) throws IOException { // different return type

    DataBag result = mBagFactory.newDefaultBag(); // change here
    DataBag values = (DataBag)input.get(0);
    for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
        Tuple tuple = iterator.next();

        //logic
        Tuple t = mTupleFactory.getInstance().newTuple();


        t.append(tuple);

        result.add(t);
    }
    return result; // change here
}

}

之后我使用

注册了路径
REGISTER /usr/local/pig/UDF/UDFBAG.jar;
DEFINE Databag Databag(); // not sure how to define it 

2017-02-16 19:07:05,875 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - 遇到警告IMPLICIT_CAST_TO_INT 2次。 //定义后得到这个警告。

final1 = FOREACH final GENERATE(Databag(final));

错误1200:Pig脚本无法解析:  无效的标量投影:final:需要从关系中投射一个列,以便将其用作标量

请帮我定义UDF以及如何将DataBag传递给UDF

谢谢

1 个答案:

答案 0 :(得分:1)

尝试

final1 = FOREACH final GENERATE(Databag(*));

虽然据我所知,你的决赛包含元组,而不是元组包,所以你可能需要先用一些键对它进行分组。在这种情况下,它会像

一样
final1 = FOREACH (group final [by key or all]) GENERATE(Databag(final));