Question

我希望根据事件发生的时间为来自Mapper类的事件分配一个序列号。

例如，我有100个有时间的事件。我希望根据时间对它们进行排序，然后在reducer阶段为它们分配序列号。此外，如果重复记录是重复的，则删除重复记录阶段中的重复记录（同一事件同时发生）。

Mapper方法：

public class EventMapper extends Mapper<LongWritable, Text, Text, Event> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    Text newKey;
    Event e = new Event();
    e.setAllValues(line);
    newKey = new Text(e.getKey());
    context.write(newKey, e);
}
}

缩减方法（我想要的东西）：

public class EventReducer extends Reducer<Text, Event, Text, Text> {

public void reduce(Text key, Iterator<Event> itrtr, Context context) throws IOException, InterruptedException {
    Event e;
    List<Event> l = new ArrayList<Event>();
    while(itrtr.hasNext()){
        e = itrtr.next();
         l.add(e);
    }
    Collections.sort(l);
    long i = 1;
    for (Event event : l) {
        event.setId(++i);
        context.write(key, new Text(event.toString()));
    }
}
}

我将所有ID都设为0.我怎样才能实现这一目标？我是否采取了错误的做法。

这是Event类：

public class Event implements Writable, WritableComparable<Event> {
//Some variables and getter + setters
 @Override
public String toString() {
    String delimiter1 = "|";
    return this.date + delimiter1
            + this.evName + delimiter1
            + this.evType + delimiter1
            + this.evValue + delimiter1
            + this.name + delimiter1
            + this.id;
}

@Override
public void readFields(DataInput in) throws IOException {
    try {
        this.date = converStringToDate((WritableUtils.readString(in)).toString(), dateFormat);
    } catch (ParseException ex) {
        System.out.println("Wront date . Pe");
    }
    this.evName = WritableUtils.readString(in);
    this.evType = WritableUtils.readString(in);
    this.evValue = WritableUtils.readString(in);
    this.name = WritableUtils.readString(in);
    this.id = WritableUtils.readVLong(in);
}

@Override
public void write(DataOutput out) throws IOException {
    // TODO Auto-generated method stub
    WritableUtils.writeString(out, this.convertDateToString(date));
    WritableUtils.writeString(out, evName);
    WritableUtils.writeString(out, evType);
    WritableUtils.writeString(out, evValue);
    WritableUtils.writeString(out, name);
    WritableUtils.writeVLong(out, id);
}

public int compareTo(Event o) {
    long value = this.getDate().getTime() - o.getDate().getTime();
    if (value == 0) {
        return 0;
    } else if (value > 1) {
        return -1;
    } else {
        return 1;
    }
    }
public void setAllValues(String input) {
    String[] arrValues = input.split(delimiter);
    System.out.println("No of Values = " + arrValues.length);
    try {
        this.date = converStringToDate(arrValues[0], dateFormat);
    } catch (ParseException pe) {
        System.out.println("pe> Error in date");
    }
    if (arrValues.length >= 2) {
        this.evName = arrValues[1];
    }
    if (arrValues.length >= 3) {
        this.evType = arrValues[2];
    }
    if (arrValues.length >= 4) {
        this.evValue = arrValues[3];
    }
    if (arrValues.length >= 5) {
        this.name = arrValues[4];
    }
}

public String getKey() {
    //return convertDateToString(this.date) + this.evName + this.evType;
    return this.evName;
}
}

Answer 1

一些建议：

更改getKey（）以返回date.getTime（）。这是一个很长的值，比字符串更快。将您的内部密钥类型更改为LongWritable。
您正在利用在传递给reducer之前按键值对记录进行排序的hadoop行为。这是排序的一种方法，但您必须确保在作业配置中将numberOfReducers设置为1。否则，您将有多个Reducer在自己的分区上从1开始分配排名。
您可以使用多个reducer，但是您必须按照这项工作来合并所有内部排名的数据分区。
请记住，每个键值都会调用一次reducer，即使有多个记录包含该键（例如同时有多个事件）。如果要忽略这些重复事件，则reducer应该只将一条记录写入上下文，而不管Iterable值有多少记录。
为了正确分配rank（id），你需要在long类型的reducer中有一个实例变量（称之为counter）。您需要在setup()方法中对其进行初始化，然后在reduce()方法中将其递增。

我可以为reducer中的记录分配序列号

1 个答案: