Spark Java API:MapToPair& ReduceByKey vs. CombineByKey

时间:2016-08-11 07:10:08

标签: apache-spark aggregate rdd

我有一个Spark Java应用程序,它构建了像这样的PairRDD

<key>, <value>:
01 , 1.2
01 , 2.3
01 , 3.0
...
02 , 0.0
02 , 1.1
02 , 12.2
...

其中键代表day of month,值代表sum of sensor values。但是,我想计算每天的平均值。因此,我想到了以下两个步骤:

  1. 计算每天传感器值的数量(=按键)
  2. 总结每天的传感器值(=按键)
  3. 总和(2)很明确:只需调用reduceByKey即可每天汇总传感器值。

    但是:要计算每天的值,我有以下两个选项:

    A :使用mapToPair&amp; reduceByKey这样的交易:

    JavaPairRDD<String, Long> sensorValuesCount = sensorValuesDay.mapToPair(new PairFunction<Tuple2<String,Double>, String, Long>() {
        @Override
        public Tuple2<String, Long> call(Tuple2<String, Double> sensorValueDay) throws Exception {
            return new Tuple2<String, Long> (sensorValueDay._1(), 1L);
        }
    }).reduceByKey(new Function2<Long, Long, Long>() {
        @Override
        public Long call(Long countA, Long countB) throws Exception {
            return countA + countB;
        }
    });
    

    B :使用combineByKey交易:

    JavaPairRDD<String, Long> sensorValuesCount = sensorValuesDay.combineByKey(new Function<Double, Long>() {
        @Override
        public Long call(Double arg0) throws Exception {
            return 1L;
        }
    }, new Function2<Long, Double, Long>() {
        @Override
        public Long call(Long countYet, Double arg1) throws Exception {
            return countYet++;
        }
    }, new Function2<Long, Long, Long>() {
        @Override
        public Long call(Long countA, Long countB) throws Exception {
            return countA + countB;
        }
    });
    

    运行时或内存消耗是否存在差异?我应该选择哪一种轻量级&#34;任务?在计算&#34时,我应该采取另一种解决方案;更复杂的&#34; RDDS?谢谢!

0 个答案:

没有答案