帐户ids
,每个帐户timestamp
按username
分组。对于这些用户名组,我希望所有对(最旧的帐户,其他帐户)。
我有一个java reducer,可以将它重写为一个简单的猪脚本吗?
架构:
{group:(username),A: {(id , create_dt)}
输入:
(batman,{(id1,100), (id2,200), (id3,50)})
(lulu ,{(id7,100), (id9,50)})
期望的输出:
(batman,{(id3,id1), (id3,id2)})
(lulu ,{(id9,id7)})
答案 0 :(得分:1)
不是有人似乎在乎,但是这里有。您必须创建UDF:
desired = foreach my_input generate group as n, FIND_PAIRS(A) as pairs_bag;
和UDF:
public class FindPairs extends EvalFunc<DataBag> {
@Override
public DataBag exec(Tuple input) throws IOException {
Long pivotCreatedDate = Long.MAX_VALUE;
Long pivot = null;
DataBag accountsBag = (DataBag) input.get(0);
for (Tuple account : accountsBag){
Long accountId = Long.parseLong(account.get(0).toString());
Long creationDate = Long.parseLong(account.get(4).toString());
if (creationDate < pivotCreatedDate ) {
// pivot is the one with the minimal creation_dt
pivot = accountId;
pivotCreatedDate = creationDate;
}
}
DataBag allPairs = BagFactory.getInstance().newDefaultBag();
if (pivot != null){
for (Tuple account : accountsBag){
Long accountId = Long.parseLong(account.get(0).toString());
Long creationDate = Long.parseLong(account.get(4).toString());
if (!accountId.equals(pivot)) {
// we don't want any self-pairs
Tuple output = TupleFactory.getInstance().newTuple(2);
if (pivot < accountId){
output.set(0, pivot.toString());
output.set(1, accountId.toString());
}
else {
output.set(0, accountId.toString());
output.set(1, pivot.toString());
}
allPairs.add(output);
}
}
return allPairs;
}
如果你想玩得很好,请加上:
/**
* Letting pig know that we emit a bag with tuples, each representing a pair of accounts
*/
@Override
public Schema outputSchema(Schema input) {
try{
Schema pairSchema = new Schema();
pairSchema.add(new FieldSchema(null, DataType.BYTEARRAY));
pairSchema.add(new FieldSchema(null, DataType.BYTEARRAY));
return new Schema(
new FieldSchema(null,
new Schema(pairSchema), DataType.BAG));
}catch (Exception e){
return null;
}
}
}