Question

我有一个猪脚本，代码如下：

scores = LOAD 'file' as (id:chararray, scoreid:chararray, score:int);
scoresGrouped = GROUP scores by id;
top10s = foreach scoresGrouped{
    sorted = order scores by score DESC;
    sorted10 = LIMIT sorted 10;
    GENERATE group as id, sorted10.scoreid as top10candidates;
};

它给了我一个像

这样的包

 id1, {(scoreidA),(scoreidB),(scoreIdC)..(scoreIdFoo)}

但是，我希望也包含项目索引，所以我会得到像

这样的结果

 id1, {(scoreidA,1),(scoreidB,2),(scoreIdC,3)..(scoreIdFoo,10)}

是否可以在嵌套的foreach中以某种方式包含索引，或者我是否必须编写自己的UDF以便在之后添加它？

Answer 1

您需要一个UDF，其唯一参数是您要添加排名的已排序包。我之前有同样的需要。这是exec函数为您节省一点时间：

public DataBag exec(Tuple b) throws IOException {
    try {
        DataBag bag = (DataBag) b.get(0);
        Iterator<Tuple> it = bag.iterator();
        while (it.hasNext()) {
            Tuple t = (Tuple)it.next();
            if (t != null && t.size() > 0 && t.get(0) != null) {
                t.append(n++);
            }
            newBag.add(t);
        }
    } catch (ExecException ee) {
        throw ee;
    } catch (Exception e) {
        int errCode = 2106;
        String msg = "Error while computing item number in " + this.getClass().getSimpleName();
        throw new ExecException(msg, errCode, PigException.BUG, e);           
    }

    return newBag;
}

（计数器n被初始化为exec函数之外的类变量。）

您还可以实施Accumulator interface，即使您的整个行李不适合内存，也可以执行此操作。（COUNT内置函数执行此操作。）请务必在n = 1L;方法中设置cleanup()，在return newBag;设置getValue()，其他所有内容都相同。< / p>

Answer 2

对于包中的元素索引，您可以使用LinkedIn的Enumerate项目中的DataFu UDF：

register '/path_to_jar/datafu-0.0.4.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
scores = ...
...
result = foreach top10s generate id, Enumerate(top10candidates);

猪：在嵌套的foreach中获取索引

2 个答案: