在我的代码中,我有几个计算,我分别在三个不同的列上执行以计算方差/ std / mean等.....问题是由于必须重新映射值然后运行相当长计算每列的方差。
是否可以同时运行所有这三个语句异步并检索示例中下面指定的3个变量中的最终值?
final Double varSHOUR = dataset.mapToDouble(new DoubleFunction<modelEhealth>() {
@Override
public double call(modelEhealth modelEhealth) throws Exception {
return modelEhealth.getSHOUR();
}
}).variance();
final Double varHOURLYFRAMESIN = dataset.mapToDouble(new DoubleFunction<modelEhealth>() {
@Override
public double call(modelEhealth modelEhealth) throws Exception {
return modelEhealth.getHOURLYFRAMESIN();
}
}).variance();
final Double varHOURLYFRAMESOUT = dataset.mapToDouble(new DoubleFunction<modelEhealth>() {
@Override
public double call(modelEhealth modelEhealth) throws Exception {
return modelEhealth.getHOURLYFRAMESOUT();
}
}).variance();
答案 0 :(得分:1)
您必须使用JavaDoubleRDD.variance()
类而不是Double来模仿Spark对ModelHealth
的实现。这并不难,因为您可以使用Spark的StatCounter
进行实际计算,您只需要其中的3个。
例如,我将使用一个简单的ModelHealth
与3 Double
个字段v1,v2,v3:
static class ModelHealth {
final Double v1;
final Double v2;
final Double v3;
}
然后:
JavaRDD<ModelHealth> dataset = // your data
// zero value - three empty StatCounters:
final Tuple3<StatCounter, StatCounter, StatCounter> zeroValue = new Tuple3<>(new StatCounter(), new StatCounter(), new StatCounter());
// using `aggregate` to aggregate ModelHealth records into three StatCounters:
final Tuple3<StatCounter, StatCounter, StatCounter> stats = dataset.aggregate(zeroValue, new Function2<Tuple3<StatCounter, StatCounter, StatCounter>, ModelHealth, Tuple3<StatCounter, StatCounter, StatCounter>>() {
@Override
public Tuple3<StatCounter, StatCounter, StatCounter> call(Tuple3<StatCounter, StatCounter, StatCounter> stats, ModelHealth record) throws Exception {
// merging record into tuple of StatCounters - each value merged with corresponding counter
stats._1().merge(record.v1);
stats._2().merge(record.v2);
stats._3().merge(record.v3);
return stats;
}
}, new Function2<Tuple3<StatCounter, StatCounter, StatCounter>, Tuple3<StatCounter, StatCounter, StatCounter>, Tuple3<StatCounter, StatCounter, StatCounter>>() {
@Override
public Tuple3<StatCounter, StatCounter, StatCounter> call(Tuple3<StatCounter, StatCounter, StatCounter> v1, Tuple3<StatCounter, StatCounter, StatCounter> v2) throws Exception {
// merging tuples of StatCounters - each counter merged with corresponding one
v1._1().merge(v2._1());
v1._2().merge(v2._2());
v1._3().merge(v2._3());
return v1;
}
});
Double v1_variance = stats._1().variance();
Double v2_variance = stats._2().variance();
Double v3_variance = stats._3().variance();
这给出了相同的结果,但在数据集上只有一个聚合。