如何写这个Pig查询?

时间:2012-11-08 20:05:23

标签: apache-pig

我在两个集合之间有一个多对多的映射表。映射表中的每一行都代表一个带有权重分数的可能映射。

mapping(id1, id2, weight)

查询:生成id1和id2之间的一对一映射。使用最低权重删除重复映射。如果有平局,则输出任意一个。

示例输入:

(1, X, 1)
(1, Y, 2)
(2, X, 3)
(2, Y, 1)
(3, Z, 2)

输出

(1, X)
(2, Y)
(3, Z)

1和2都映射到X和Y.我们选择映射(1,X)和(2,Y),因为它们的权重最低。

2 个答案:

答案 0 :(得分:2)

我将假设您只对权重是涉及id1的任何映射中最低的映射感兴趣,也是涉及id2的任何映射中最低的映射。例如,如果你另外有映射(2,Y,4),它就不会与(1,X,1)冲突。我将排除这样的映射,因为重量小于(1,Y,2)和(2,X,3),这是不合格的。

我的解决方案如下:找到每个id1的最小映射权重,然后将其加入映射关系以供将来参考。使用nested foreach遍历每个id2:使用ORDER和LIMIT选择id2最小权重的记录,然后仅在权重也是id1的最小值时保留它。

这是完整的脚本,在您的输入上进行了测试:

mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);

id1_weights =
    FOREACH (GROUP mapping BY id1)
    GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
    FOREACH (JOIN mapping BY id1, id1_weights BY id1)
    GENERATE mapping::id1, id2, weight, id1_min_weight;

accepted_mappings =
    FOREACH (GROUP mapping_with_id1_mins BY id2)
    {
        ordered = ORDER mapping_with_id1_mins BY weight;
        selected = LIMIT ordered 1;
        acceptable = FILTER selected BY weight == id1_min_weight;
        GENERATE FLATTEN(acceptable);
    };

DUMP accepted_mappings;

答案 1 :(得分:0)

使用Java UDF解决了这个问题。从某种意义上来说,它并不是完美的,它不会最大化一对一映射的数量,但这已经足够了。

猪:

d = load 'test' as (fid, iid, priority:double);
g = group d by fid;
o = foreach g generate FLATTEN(com.propeld.pig.DEDUP(d)) as (fid, iid, priority);
store o into 'output';

g2 = group o by iid;
o2 = foreach g2 generate FLATTEN(com.propeld.pig.DEDUP(o)) as (fid, iid, priority);
store o2 into 'output2';

Java UDF:

package com.propeld.pig;

import java.io.IOException;
import java.util.Iterator;

import org.apache.pig.Algebraic;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class DEDUP extends EvalFunc<Tuple> implements Algebraic{
    public String getInitial() {return Initial.class.getName();}
    public String getIntermed() {return Intermed.class.getName();}
    public String getFinal() {return Final.class.getName();}
    static public class Initial extends EvalFunc<Tuple> {
        private static TupleFactory tfact = TupleFactory.getInstance();
        public Tuple exec(Tuple input) throws IOException {
            // Initial is called in the map.
            // we just send the tuple down
            try {
                // input is a bag with one tuple containing
                // the column we are trying to operate on
                DataBag bg = (DataBag) input.get(0);
                if (bg.iterator().hasNext()) {
                    Tuple dba = (Tuple) bg.iterator().next();
                    return dba;
                } else {
                    // make sure that we call the object constructor, not the list constructor
                    return tfact.newTuple((Object) null);
                }
            } catch (ExecException e) {
                throw e;
            } catch (Exception e) {
                int errCode = 2106;
                throw new ExecException("Error executing an algebraic function", errCode, PigException.BUG, e);
            }
        }
    }
    static public class Intermed extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {
            return dedup(input);
        }
    }
    static public class Final extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {return dedup(input);}
    }

    static protected Tuple dedup(Tuple input) throws ExecException, NumberFormatException {
        DataBag values = (DataBag)input.get(0);
        Double min = Double.MAX_VALUE;
        Tuple result = null;
        for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
            Tuple t = (Tuple) it.next();

            if ((Double)t.get(2) < min){
                min = (Double)t.get(2);
                result = t;
            }
        }
        return result;
    }

    @Override
    public Tuple exec(Tuple input) throws IOException {
        return dedup(input);
    }
}