Question

我正在使用以下代码在spark中映射一些数据。在将其映射到rdd对时，我需要为每个任务生成一个唯一的序列号。我尝试使用蓄电池。但是我从异常中了解到，在任务内部不可能通过累加器检索值。请为我提供帮助，因为我刚起步并且对解决方案一无所知。

Accumulator<Integer> uniqueIdAccumulator = context.getJavaSparkContext().accumulator(0, "uniqueId");
JavaPairRDD<String, String> rdd1 = javaPairRdd.mapToPair(f-> {
    uniqueIdAccumulator.add(1);
    return new Tuple2<String,String>(f._1, this.getMessageString(f._2, null,uniqueIdAccumulator.value()));
});

Answer 1

JavaPairRDD rdd1 = javaPairRdd.zipWithIndex().mapToPair(f-> { return new Tuple2(f._1._1,this.getMessageString(f._1._2, null, f._2)); });

这里不需要累加器。 ZipWithIndex帮助获得了解决方案。 ZipWIthIndex返回带有现有元组和Long索引号的RDD。我使用索引号来生成唯一的序列号。

如何为Spark中的每个任务生成数字序列

1 个答案: