varync / std / mean

时间:2016-03-20 05:42:15

标签: scala apache-spark dataframe apache-spark-sql

在我的代码中,我有几个计算,我分别在三个不同的列上执行以计算方差/ std / mean等.....问题是由于必须重新映射值然后运行相当长计算每列的方差。

是否可以同时运行所有这三个语句异步并检索示例中下面指定的3个变量中的最终值?

final Double varSHOUR               = dataset.mapToDouble(new DoubleFunction<modelEhealth>() {
    @Override
    public double call(modelEhealth modelEhealth) throws Exception {
        return modelEhealth.getSHOUR();
    }
}).variance();
final Double varHOURLYFRAMESIN      = dataset.mapToDouble(new DoubleFunction<modelEhealth>() {
    @Override
    public double call(modelEhealth modelEhealth) throws Exception {
        return modelEhealth.getHOURLYFRAMESIN();
    }
}).variance();
final Double varHOURLYFRAMESOUT     = dataset.mapToDouble(new DoubleFunction<modelEhealth>() {
    @Override
    public double call(modelEhealth modelEhealth) throws Exception {
        return modelEhealth.getHOURLYFRAMESOUT();
    }
}).variance();

1 个答案:

答案 0 :(得分:1)

您必须使用JavaDoubleRDD.variance()类而不是Double来模仿Spark对ModelHealth的实现。这并不难,因为您可以使用Spark的StatCounter进行实际计算,您只需要其中的3个。

例如,我将使用一个简单的ModelHealth与3 Double个字段v1,v2,v3:

static class ModelHealth {
    final Double v1;
    final Double v2;
    final Double v3;
}

然后:

JavaRDD<ModelHealth> dataset = // your data

// zero value - three empty StatCounters:
final Tuple3<StatCounter, StatCounter, StatCounter> zeroValue = new Tuple3<>(new StatCounter(), new StatCounter(), new StatCounter());

// using `aggregate` to aggregate ModelHealth records into three StatCounters:
final Tuple3<StatCounter, StatCounter, StatCounter> stats = dataset.aggregate(zeroValue, new Function2<Tuple3<StatCounter, StatCounter, StatCounter>, ModelHealth, Tuple3<StatCounter, StatCounter, StatCounter>>() {
    @Override
    public Tuple3<StatCounter, StatCounter, StatCounter> call(Tuple3<StatCounter, StatCounter, StatCounter> stats, ModelHealth record) throws Exception {
        // merging record into tuple of StatCounters - each value  merged with corresponding counter
        stats._1().merge(record.v1);
        stats._2().merge(record.v2);
        stats._3().merge(record.v3);
        return stats;
    }
}, new Function2<Tuple3<StatCounter, StatCounter, StatCounter>, Tuple3<StatCounter, StatCounter, StatCounter>, Tuple3<StatCounter, StatCounter, StatCounter>>() {
    @Override
    public Tuple3<StatCounter, StatCounter, StatCounter> call(Tuple3<StatCounter, StatCounter, StatCounter> v1, Tuple3<StatCounter, StatCounter, StatCounter> v2) throws Exception {
        // merging tuples of StatCounters - each counter merged with corresponding one
        v1._1().merge(v2._1());
        v1._2().merge(v2._2());
        v1._3().merge(v2._3());
        return v1;
    }
});

Double v1_variance = stats._1().variance();
Double v2_variance = stats._2().variance();
Double v3_variance = stats._3().variance();

这给出了相同的结果,但在数据集上只有一个聚合。