我正在使用Pig来解析我的应用程序日志,以了解上个月未被调用的用户调用了哪些公开方法(由同一用户)。
我设法在上个月之前和上个月之后获得用户分组的方法:
上个月关系样本
u1 {(m1),(m2)}
u2 {(m3),(m4)}
上个月关系样本
u1 {(m1),(m3)}
u2 {(m1),(m4)}
我想要的是,用户可以找到AFTER中哪些方法不在BEFORE中,即
NEWLY_CALLED预期结果
u1 {(m3)}
u2 {(m1)}
问题:我怎么能在猪身上做到这一点?是否可以减去行李?
我尝试过DIFF功能,但它没有执行预期的减法。
此致
乔尔
答案 0 :(得分:2)
我认为您需要编写UDF,然后才能使用
Set<T> setA ...
Set<T> setB ...
Set<T> setAminusB = setA.subtract(setB);
答案 1 :(得分:2)
对于那些可能感兴趣的人,这里是我在下面写的类的减法函数,并将它提交给Pig(PIG-2881):
/**
* Subtract takes two bags as arguments returns a new bag composed of tuples of first bag not in the second bag.<br>
* If null bag arguments are replaced by empty bags.
* <p>
* The implementation assumes that both bags being passed to this function will fit entirely into memory simultaneously.
* </br>
* If that is not the case the UDF will still function, but it will be <strong>very</strong> slow.
*/
public class Subtract extends EvalFunc<DataBag> {
/**
* Compares the two bag fields from input Tuple and returns a new bag composed of elements of first bag not in the second bag.
* @param input a tuple with exactly two bag fields.
* @throws IOException if there are not exactly two fields in a tuple or if they are not {@link DataBag}.
*/
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.size() != 2) {
throw new ExecException("Subtract expected two inputs but received " + input.size() + " inputs.");
}
DataBag bag1 = toDataBag(input.get(0));
DataBag bag2 = toDataBag(input.get(1));
return subtract(bag1, bag2);
}
private static String classNameOf(Object o) {
return o == null ? "null" : o.getClass().getSimpleName();
}
private static DataBag toDataBag(Object o) throws ExecException {
if (o == null) {
return BagFactory.getInstance().newDefaultBag();
}
if (o instanceof DataBag) {
return (DataBag) o;
}
throw new ExecException(format("Expecting input to be DataBag only but was '%s'", classNameOf(o)));
}
private static DataBag subtract(DataBag bag1, DataBag bag2) {
DataBag subtractBag2FromBag1 = BagFactory.getInstance().newDefaultBag();
// convert each bag to Set, this does make the assumption that the sets will fit in memory.
Set<Tuple> set1 = toSet(bag1);
// remove elements of bag2 from set1
Iterator<Tuple> bag2Iterator = bag2.iterator();
while (bag2Iterator.hasNext()) {
set1.remove(bag2Iterator.next());
}
// set1 now contains all elements of bag1 not in bag2 => we can build the resulting DataBag.
for (Tuple tuple : set1) {
subtractBag2FromBag1.add(tuple);
}
return subtractBag2FromBag1;
}
private static Set<Tuple> toSet(DataBag bag) {
Set<Tuple> set = new HashSet<Tuple>();
Iterator<Tuple> iterator = bag.iterator();
while (iterator.hasNext()) {
set.add(iterator.next());
}
return set;
}
}