在Pig UDF参数上使用别名

时间:2015-12-02 15:16:56

标签: hadoop apache-pig

我需要你的帮助才能知道如何在我的Pig udf函数中使用别名(存储的元组),我这样做了:

my_file.csv文件

101,message here
102,message here
103,message here
...

我的剧本PIG:

X = load'mydata.csv' using PigStorage(',') as (myVar:chararray);
A = load'my_file.csv' using PigStorage(',') as (key:chararray,value:chararray);
B = GROUP par ALL;
C = foreach B {
    D = ORDER par BY key;
    GENERATE BagToTuple(D);
};

the result of the C is something like (101,message here, 102, message here, 103, message here...)

现在我需要的是将这个结果传递给我的udf函数,如:

Z = foreach X generate MYUDF(myVar, C);

别名“C”是元组键,值,键,值......

MYUDF:

import java.io.IOException;
import java.util.regex.Pattern;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.PigWarning;
import org.apache.pig.data.DataType;
import org.apache.pig.impl.util.WrappedIOException;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class ReDecode extends EvalFunc<String> {
    int numParams = -1;
    Pattern mPattern = null;
    @Override
    public Schema outputSchema(Schema input) {
        try {
            return new Schema(new Schema.FieldSchema(getSchemaName(this
                    .getClass().getName().toLowerCase(), input),
                    DataType.CHARARRAY));
        } catch (Exception e) {
            return null;
        }
    }
@Override
    public String exec(Tuple tuple) throws IOException {
        if (numParams==-1)  // Not initialized
        {
            numParams = tuple.size();
            if (numParams <= 2) {
                String msg = "Decode: Atleast an expression and default string is required.";
                throw new IOException(msg);
            }
            if (tuple.size()%2!=0) {
                String msg = "ItssPigUDFs.ReDecode : Some parameters are unmatched.";
                throw new IOException(msg);
            }
        }

        if (tuple.get(0)==null)
            return null;

        try {
            for (int count = 1; count < numParams - 1; count += 2)
            {

                mPattern=Pattern.compile((String)tuple.get(count));
                if (mPattern.matcher((String)tuple.get(0)).matches())
                {
                    return (String)tuple.get(count+1);
                }
            }
        } catch (ClassCastException e) {
            warn("ItssPigUDFs.ReDecode : Data type error", PigWarning.UDF_WARNING_1);
            return null;
        } catch (NullPointerException e) {
            String msg = "ItssPigUDFs.ReDecode : Encounter null in the input";
            throw new IOException(msg);
        }

        return (String)tuple.get(tuple.size()-1);
    }

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

我认为不需要numParams;您到达UDF的参数数量为input.size()

因此,如果您致电MYUDF(myVar, C),那么您应该能够像String myVar = (String) input.get(0)Tuple param2 = input.get(1)那样使用Java获取这些值。