我有一个Spark Java应用程序,它构建了像这样的PairRDD
<key>, <value>:
01 , 1.2
01 , 2.3
01 , 3.0
...
02 , 0.0
02 , 1.1
02 , 12.2
...
其中键代表day of month
,值代表sum of sensor values
。但是,我想计算每天的平均值。因此,我想到了以下两个步骤:
总和(2)很明确:只需调用reduceByKey
即可每天汇总传感器值。
但是:要计算每天的值,我有以下两个选项:
A :使用mapToPair
&amp; reduceByKey
这样的交易:
JavaPairRDD<String, Long> sensorValuesCount = sensorValuesDay.mapToPair(new PairFunction<Tuple2<String,Double>, String, Long>() {
@Override
public Tuple2<String, Long> call(Tuple2<String, Double> sensorValueDay) throws Exception {
return new Tuple2<String, Long> (sensorValueDay._1(), 1L);
}
}).reduceByKey(new Function2<Long, Long, Long>() {
@Override
public Long call(Long countA, Long countB) throws Exception {
return countA + countB;
}
});
B :使用combineByKey
交易:
JavaPairRDD<String, Long> sensorValuesCount = sensorValuesDay.combineByKey(new Function<Double, Long>() {
@Override
public Long call(Double arg0) throws Exception {
return 1L;
}
}, new Function2<Long, Double, Long>() {
@Override
public Long call(Long countYet, Double arg1) throws Exception {
return countYet++;
}
}, new Function2<Long, Long, Long>() {
@Override
public Long call(Long countA, Long countB) throws Exception {
return countA + countB;
}
});
运行时或内存消耗是否存在差异?我应该选择哪一种轻量级&#34;任务?在计算&#34时,我应该采取另一种解决方案;更复杂的&#34; RDDS?谢谢!