Pig - 将Databag传递给UDF构造函数

时间:2013-12-19 12:58:43

标签: apache-pig user-defined-functions

我有一个加载场地数据的脚本:

venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);

然后我想创建一个UDF,它有一个接受场地类型的构造函数。

所以我试着像这样定义这个UDF:

DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);

这是实际的UDF:

public class GenerateVenues extends EvalFunc<Tuple> {

    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();

    private static final String ALLCHARS = "(.*)";
    private ArrayList<String> venues;

    private String regex;

    public GenerateVenues(DataBag venuesBag) {
        Iterator<Tuple> it = venuesBag.iterator();
        venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
        String current = "";
        regex = "";
        while (it.hasNext()){
            Tuple t = it.next();
            try {
                current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
                venues.add((String) t.get(0));
            } catch (ExecException e) {
                throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
            }
            regex += current + (it.hasNext() ? "|" : "");
        }
    }

    @Override
    public Tuple exec(Tuple tuple) throws IOException {
        // expect one string
        if (tuple == null || tuple.size() != 2) {
            throw new IllegalArgumentException(
                    "BagTupleExampleUDF: requires two input parameters.");
        }
        try {
            String tweet = (String) tuple.get(0);
            for (String venue: venues)
            {
                if (tweet.matches(ALLCHARS + venue + ALLCHARS))
                {
                    Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
                    return output;
                }
            }
            return null;
        } catch (Exception e) {
            throw new IOException(
                    "BagTupleExampleUDF: caught exception processing input.", e);
        }
    }
}

执行时脚本在DEFINE之前的(venues);部分触发错误:

2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60>  mismatched input 'venues' expecting RIGHT_PAREN

显然我做错了什么,你能不能帮我搞清楚是什么问题。 是UDF不能接受场地关系作为参数。或者这个关系没有DataBag代表public GenerateVenues(DataBag venuesBag)? 谢谢!

PS我正在使用Pig版 0.11.1.1.3.0.0-107

2 个答案:

答案 0 :(得分:4)

正如@WinnieNicklaus所说,你可以将字符串传递给UDF构造函数。

话虽如此,问题的解决方案是使用分布式缓存,您需要覆盖public List<String> getCacheFiles()以返回将通过分布式缓存提供的文件名列表。有了它,您可以将文件作为本地文件读取并构建表。

缺点是Pig没有初始化函数,所以你必须实现像

这样的东西
private void init() {
    if (!this.initialized) {
        // read table
    }
}

然后将其称为exec中的第一件事。

答案 1 :(得分:0)

您不能将关系用作UDF构造函数中的参数。只有字符串可以作为参数传递,如果它们实际上是另一种类型,则必须在构造函数中解析它们。