Hadoop对值的二级排序。排序,吞噬松散的价值观

时间:2013-01-16 16:11:05

标签: sorting hadoop mapreduce partitioning

这是一个相当普遍的问题,我不明白该选择什么。

我有字段: id,creationDate,state,dateDiff

id 自然键

我需要进入我的减速机:

KEY(id),VALUE(creationDate,state,dateDiff)

VALUE(creationDate,state,dateDiff)应按以下顺序排序:creationDate,state

我应该选择什么钥匙? 我确实创建了复合键(id,creationDate,state)

我做了实施 id

的分区程序

id

id,creationDate,州

分拣机

我的减速机只有唯一的ID ... 例如:

1 123 true  6
1 456 false 6
1 789 true  7

我只得到

1 123 true  6

在我的减速机中。好像我没有得到分拣机,分区器,石斑鱼:(有一点理解。

这是我的代码:

public class POIMapper extends Mapper<LongWritable, Text, XVLRKey, XVLRValue>{

    private static final Log LOG = LogFactory.getLog(POIMapper.class);

    @Override
    public void map(LongWritable key, Text csvLine, Context context) throws IOException, InterruptedException {
        Pair<XVLRKey, XVLRValue> xvlrPair = POIUtil.parseKeyAndValue(csvLine.toString(), POIUtil.CSV_DELIMITER);
        context.write(xvlrPair.getValue0(), xvlrPair.getValue1());
    }

}

public class POIReducer extends Reducer<XVLRKey, XVLRValue, LongWritable, Text>{

    private static final Log LOG = LogFactory.getLog(POIReducer.class);

    private final Text textForOutput = new Text();

    @Override()
    public void reduce(XVLRKey key, Iterable<XVLRValue> values, Context context)
                                                                            throws IOException, InterruptedException {
        XVLROutput out = null;
//Just check that values are correctly attached to keys. No logic here...
        LOG.info("\nPOIReducer: key:"+key);
        for(XVLRValue value : values){
            LOG.info("\n --- --- --- value:"+value+"\n");
            textForOutput.set(print(key, value));
            context.write(key.getMsisdn(), textForOutput);
        }
    }

    private String print(XVLRKey key, XVLRValue value){
        StringBuilder builder = new StringBuilder();
        builder.append(value.getLac())          .append("\t")
               .append(value.getCellId())       .append("\t")
               .append(key.getDateOccurrence()) .append("\t")
               .append(value.getTimeDelta());
        return builder.toString();
    }
}

工作代码:

JobBuilder<POIJob> jobBuilder = createTestableJobInstance();

        jobBuilder.withOutputKey(XVLRKey.class);
        jobBuilder.withOutputValue(XVLRValue.class);

        jobBuilder.withMapper(POIMapper.class);
        jobBuilder.withReducer(POIReducer.class);

        jobBuilder.withInputFormat(TextInputFormat.class);
        jobBuilder.withOutputFormat(TextOutputFormat.class);

        jobBuilder.withPartitioner(XVLRKeyPartitioner.class);
        jobBuilder.withSortComparator(XVLRCompositeKeyComparator.class);
        jobBuilder.withGroupingComparator(XVLRKeyGroupingComparator.class);

        boolean result = buildSubmitAndWaitForCompletion(jobBuilder);
        MatcherAssert.assertThat(result, Matchers.is(true));




public class XVLRKeyPartitioner extends Partitioner<XVLRKey, XVLRValue> {

    @Override
    public int getPartition(XVLRKey key, XVLRValue value, int numPartitions) {
            return Math.abs(key.getMsisdn().hashCode() * 127) % numPartitions;
    }
}

public class XVLRCompositeKeyComparator extends WritableComparator {

    protected XVLRCompositeKeyComparator() {
        super(XVLRKey.class, true);
    }

    @Override
    public int compare(WritableComparable writable1, WritableComparable writable2) {
        XVLRKey key1 = (XVLRKey) writable1;
        XVLRKey key2 = (XVLRKey) writable2;
       return key1.compareTo(key2);
    }
}

public class XVLRKeyGroupingComparator extends WritableComparator {

    protected XVLRKeyGroupingComparator() {
        super(XVLRKey.class, true);
    }

    @Override
    public int compare(WritableComparable writable1, WritableComparable writable2) {

        XVLRKey key1 = (XVLRKey) writable1;
        XVLRKey key2 = (XVLRKey) writable2;

        return key1.getMsisdn().compareTo(key2.getMsisdn());

    }
}

public class XVLRKey implements WritableComparable<XVLRKey>{

    private  final LongWritable msisdn;
    private  final LongWritable dateOccurrence;
    private  final BooleanWritable state;
//getters-setters
}

public class XVLRValue implements WritableComparable<XVLRValue> {

    private final LongWritable lac;
    private final LongWritable cellId;
    private final LongWritable timeDelta;
    private final LongWritable dateOccurrence;
    private final BooleanWritable state;
//getters-setterrs
}

请注意XVLRKey,XVLRValue确实有重复的字段。我在XVLRKey中重复了dateOccurrence,因为我想在reducer中获取排序值。它们应按dateOccurrence排序。

我找不到如何在不重复的情况下解决这个问题的方法。

1 个答案:

答案 0 :(得分:0)

在二级排序情况下(如您所描述的),当您从迭代器中检索下一个值时,您所拥有的键的值会发生变化。

这是因为Hadoop框架重用了对象的实例,以尽可能避免对象创建和垃圾收集。

因此,当您调用“next()”时,框架也会更改密钥实例中的数据。

所以如果你移动

    LOG.info("\nPOIReducer: key:"+key);

语句,以便它在for循环中,您应该看到所有键都来了。

由于这种影响,我基本上使用以下“规则”来完成工作:

  

该密钥仅供框架用于将值指向   对减速机。

这意味着

  1. 我可能需要的一切都必须存在于价值中。
  2. 在reducer中我只看值,我总是丢弃/忽略键。
  3. 也可以在值中找到用于创建密钥的属性。