我写了一个火花程序,用于从textSocketStream
接收数据,我计算温度值的平均值。当我在~1 min
之后停止向我的Spark群集发送数据时,平均值不应该在窗口时间内更改1h
,因此大约有59 min
个结束,改变!
现在我发现问题:对我来说,数据量是正确的:窗口DStream
中有100个条目,但计算出的值总和(以及{{1}计算的平均值})在几个不同的平均值之间波动。
此处控制台输出代码段(在停止发送avg = sum / count
(总和与计数)和windowedTempJoinPairDStream.print()
(平均值)的数据后,均为windowedTempAvg.print()
:
PairDStream<deviceId, [value]>
这里是上面的不同平均值,简而言之:
-------------------------------------------
Time: 1472801338000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))
-------------------------------------------
Time: 1472801338000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)
-------------------------------------------
Time: 1472801339000 ms
-------------------------------------------
(1-2-a-b-c,(49.159016,100))
-------------------------------------------
Time: 1472801339000 ms
-------------------------------------------
(1-2-a-b-c,0.49159014)
-------------------------------------------
Time: 1472801340000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))
-------------------------------------------
Time: 1472801340000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)
-------------------------------------------
Time: 1472801341000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))
-------------------------------------------
Time: 1472801341000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)
-------------------------------------------
Time: 1472801342000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))
-------------------------------------------
Time: 1472801342000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)
-------------------------------------------
Time: 1472801343000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))
-------------------------------------------
Time: 1472801343000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)
-------------------------------------------
Time: 1472801344000 ms
-------------------------------------------
(1-2-a-b-c,(49.15901,100))
-------------------------------------------
Time: 1472801344000 ms
-------------------------------------------
(1-2-a-b-c,0.4915901)
对我来说,这似乎是一个舍入问题,因为我的温度值是(1-2-a-b-c,0.49159008)
(1-2-a-b-c,0.49159014)
(1-2-a-b-c,0.49159008)
(1-2-a-b-c,0.49159008)
(1-2-a-b-c,0.49159008)
(1-2-a-b-c,0.49159008)
(1-2-a-b-c,0.4915901)
类型。如果这可能,如何解决问题?
Float
类型的温度值一切正常,没有波动......
如果有用,请在此处找到该程序的相应代码段:
Integer
编辑:使用 JavaReceiverInputDStream<String> ingoingStream = streamingContext.socketTextStream(serverIp, 11833);
// 2. Map the DStream<String> to a DStream<SensorData> by deserializing JSON objects
JavaDStream<SensorData> sensorDStream = ingoingStream.map(new Function<String, SensorData>() {
public SensorData call(String json) throws Exception {
ObjectMapper om = new ObjectMapper();
return (SensorData)om.readValue(json, SensorData.class);
}
}).cache();
/************************************************ MOVIING AVERAGE OF TEMPERATURE *******************************************************************/
// Collect the data to a window of time (this is the time period for average calculation, older data is removed from stream!)
JavaDStream<SensorData> windowMovingAverageSensorDataTemp = sensorDStream.window(windowSizeMovingAverageTemperature);
windowMovingAverageSensorDataTemp.print();
// Map this SensorData stream to a new PairDStream, with key = deviceId (so we can make calculations by grouping by the id)
// .cache the Stream, because we re-use it more than 1 time!
JavaPairDStream<String, SensorData> windowMovingAverageSensorDataTempPairDStream = windowMovingAverageSensorDataTemp.mapToPair(new PairFunction<SensorData, String, SensorData>() {
public Tuple2<String, SensorData> call(SensorData data) throws Exception {
return new Tuple2<String, SensorData>(data.getIdSensor(), data);
}
}).cache();
// a) Map the PairDStream from above to a new PairDStream of form <deviceID, temperature>
// b) Sum up the values to the total sum, grouped also by key (= device id)
// => combined these two transactions, could also be called separately (like above)
JavaPairDStream<String, Float> windowMovingAverageSensorDataTempPairDStreamSum = windowMovingAverageSensorDataTempPairDStream.mapToPair(new PairFunction<Tuple2<String,SensorData>, String, Float>() {
public Tuple2<String, Float> call(Tuple2<String, SensorData> sensorDataPair) throws Exception {
String key = sensorDataPair._1();
Float value = sensorDataPair._2().getValTemp();
return new Tuple2<String, Float>(key, value);
}
}).reduceByKey(new Function2<Float, Float, Float>() {
public Float call(Float sumA, Float sumB) throws Exception {
return sumA + sumB;
}
});
// a) Map the PairDStream from above to a new PairDStream of form <deviceID, 1L> to prepare the counting (1 = 1 entry)
// b) Sum up the values to the total count of entries, grouped by key (= device id)
// => also combined both calls
JavaPairDStream<String, Long> windowMovingAverageSensorDataTempPairDStreamCount = windowMovingAverageSensorDataTempPairDStream.mapToPair(new PairFunction<Tuple2<String,SensorData>, String, Long>() {
public Tuple2<String, Long> call(Tuple2<String, SensorData> sensorDataPair) throws Exception {
String key = sensorDataPair._1();
Long value = 1L;
return new Tuple2<String, Long>(key, value);
}
}).reduceByKey(new Function2<Long, Long, Long>() {
public Long call(Long countA, Long countB) throws Exception {
return countA + countB;
}
});
// Make a join of the sum and count Streams, so this puts together data with same keys (device id)
// This results in a new PairDStream of <deviceID, <sumOfTemp, countOfEntries>>
JavaPairDStream<String, Tuple2<Float, Long>> windowedTempJoinPairDStream = windowMovingAverageSensorDataTempPairDStreamSum.join(windowMovingAverageSensorDataTempPairDStreamCount).cache();
// Calculate the average temperature by avg = sumOfTemp / countOfEntries, do this for each key (device id)
JavaPairDStream<String, Float> windowedTempAvg = windowedTempJoinPairDStream.mapToPair(new PairFunction<Tuple2<String,Tuple2<Float,Long>>, String, Float>() {
public Tuple2<String, Float> call(Tuple2<String, Tuple2<Float, Long>> joinedData) throws Exception {
String key = joinedData._1();
float tempSum = joinedData._2()._1();
long count = joinedData._2()._2();
float avg = tempSum / (float)count;
return new Tuple2<String, Float>(key, avg);
}
});
// print the joined PairDStream from above to check sum & count visually
windowedTempJoinPairDStream.print();
// print the final, calculated average values for each device id in form (deviceId, avgTemperature)
windowedTempAvg.print();
// ========================================================= START THE STREAM ============================================================
// Start streaming & listen until stream is closed
streamingContext.start();
streamingContext.awaitTermination();
进行平均计算的Spark App:
刚刚更改了我的代码以使用StatCounter
进行平均计算,但仍然得到不同的平均值:
StatCounter
这里是新的代码段:
-------------------------------------------
Time: 1473077627000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435302)
-------------------------------------------
Time: 1473077628000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435303)
-------------------------------------------
Time: 1473077629000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435301)
-------------------------------------------
Time: 1473077630000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435302)
-------------------------------------------
Time: 1473077631000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435301)
-------------------------------------------
Time: 1473077632000 ms
-------------------------------------------
(1-2-a-b-c,0.47797978724353024)
-------------------------------------------
Time: 1473077633000 ms
-------------------------------------------
(1-2-a-b-c,0.47797978724353013)
答案 0 :(得分:1)
至少乍一看这并不是特别奇怪。正如您已经建议的那样,这很可能是由于舍入错误造成的。由于FP算术既不是associative也不是可交换的,而且Spark shuffle是不确定的,我们可以预期结果会在不同的运行中波动。
你能做多少取决于你的约束:
o.a.s.util.StatCounter
来实现the Online algorithm的变体,它具有更好的数值属性。BigDecimal
等任意精确数字。