我在两个集合之间有一个多对多的映射表。映射表中的每一行都代表一个带有权重分数的可能映射。
mapping(id1, id2, weight)
查询:生成id1和id2之间的一对一映射。使用最低权重删除重复映射。如果有平局,则输出任意一个。
示例输入:
(1, X, 1)
(1, Y, 2)
(2, X, 3)
(2, Y, 1)
(3, Z, 2)
输出
(1, X)
(2, Y)
(3, Z)
1和2都映射到X和Y.我们选择映射(1,X)和(2,Y),因为它们的权重最低。
答案 0 :(得分:2)
我将假设您只对权重是涉及id1的任何映射中最低的映射感兴趣,也是涉及id2的任何映射中最低的映射。例如,如果你另外有映射(2,Y,4),它就不会与(1,X,1)冲突。我将排除这样的映射,因为重量小于(1,Y,2)和(2,X,3),这是不合格的。
我的解决方案如下:找到每个id1的最小映射权重,然后将其加入映射关系以供将来参考。使用nested foreach遍历每个id2:使用ORDER和LIMIT选择id2最小权重的记录,然后仅在权重也是id1的最小值时保留它。
这是完整的脚本,在您的输入上进行了测试:
mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);
id1_weights =
FOREACH (GROUP mapping BY id1)
GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
FOREACH (JOIN mapping BY id1, id1_weights BY id1)
GENERATE mapping::id1, id2, weight, id1_min_weight;
accepted_mappings =
FOREACH (GROUP mapping_with_id1_mins BY id2)
{
ordered = ORDER mapping_with_id1_mins BY weight;
selected = LIMIT ordered 1;
acceptable = FILTER selected BY weight == id1_min_weight;
GENERATE FLATTEN(acceptable);
};
DUMP accepted_mappings;
答案 1 :(得分:0)
使用Java UDF解决了这个问题。从某种意义上来说,它并不是完美的,它不会最大化一对一映射的数量,但这已经足够了。
猪:
d = load 'test' as (fid, iid, priority:double);
g = group d by fid;
o = foreach g generate FLATTEN(com.propeld.pig.DEDUP(d)) as (fid, iid, priority);
store o into 'output';
g2 = group o by iid;
o2 = foreach g2 generate FLATTEN(com.propeld.pig.DEDUP(o)) as (fid, iid, priority);
store o2 into 'output2';
Java UDF:
package com.propeld.pig;
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.Algebraic;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class DEDUP extends EvalFunc<Tuple> implements Algebraic{
public String getInitial() {return Initial.class.getName();}
public String getIntermed() {return Intermed.class.getName();}
public String getFinal() {return Final.class.getName();}
static public class Initial extends EvalFunc<Tuple> {
private static TupleFactory tfact = TupleFactory.getInstance();
public Tuple exec(Tuple input) throws IOException {
// Initial is called in the map.
// we just send the tuple down
try {
// input is a bag with one tuple containing
// the column we are trying to operate on
DataBag bg = (DataBag) input.get(0);
if (bg.iterator().hasNext()) {
Tuple dba = (Tuple) bg.iterator().next();
return dba;
} else {
// make sure that we call the object constructor, not the list constructor
return tfact.newTuple((Object) null);
}
} catch (ExecException e) {
throw e;
} catch (Exception e) {
int errCode = 2106;
throw new ExecException("Error executing an algebraic function", errCode, PigException.BUG, e);
}
}
}
static public class Intermed extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {
return dedup(input);
}
}
static public class Final extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {return dedup(input);}
}
static protected Tuple dedup(Tuple input) throws ExecException, NumberFormatException {
DataBag values = (DataBag)input.get(0);
Double min = Double.MAX_VALUE;
Tuple result = null;
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = (Tuple) it.next();
if ((Double)t.get(2) < min){
min = (Double)t.get(2);
result = t;
}
}
return result;
}
@Override
public Tuple exec(Tuple input) throws IOException {
return dedup(input);
}
}