Question

我写了一个火花程序，用于从textSocketStream接收数据，我计算温度值的平均值。当我在~1 min之后停止向我的Spark群集发送数据时，平均值不应该在窗口时间内更改1h，因此大约有59 min个结束，改变！

现在我发现问题：对我来说，数据量是正确的：窗口DStream中有100个条目，但计算出的值总和（以及{{1}计算的平均值}）在几个不同的平均值之间波动。

此处控制台输出代码段（在停止发送avg = sum / count（总和与计数）和windowedTempJoinPairDStream.print()（平均值）的数据后，均为windowedTempAvg.print()：

PairDStream<deviceId, [value]>

这里是上面的不同平均值，简而言之：

-------------------------------------------
Time: 1472801338000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))

-------------------------------------------
Time: 1472801338000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)

-------------------------------------------
Time: 1472801339000 ms
-------------------------------------------
(1-2-a-b-c,(49.159016,100))

-------------------------------------------
Time: 1472801339000 ms
-------------------------------------------
(1-2-a-b-c,0.49159014)

-------------------------------------------
Time: 1472801340000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))

-------------------------------------------
Time: 1472801340000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)

-------------------------------------------
Time: 1472801341000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))

-------------------------------------------
Time: 1472801341000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)

-------------------------------------------
Time: 1472801342000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))

-------------------------------------------
Time: 1472801342000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)

-------------------------------------------
Time: 1472801343000 ms
-------------------------------------------
(1-2-a-b-c,(49.159008,100))

-------------------------------------------
Time: 1472801343000 ms
-------------------------------------------
(1-2-a-b-c,0.49159008)

-------------------------------------------
Time: 1472801344000 ms
-------------------------------------------
(1-2-a-b-c,(49.15901,100))

-------------------------------------------
Time: 1472801344000 ms
-------------------------------------------
(1-2-a-b-c,0.4915901)

对我来说，这似乎是一个舍入问题，因为我的温度值是(1-2-a-b-c,0.49159008) (1-2-a-b-c,0.49159014) (1-2-a-b-c,0.49159008) (1-2-a-b-c,0.49159008) (1-2-a-b-c,0.49159008) (1-2-a-b-c,0.49159008) (1-2-a-b-c,0.4915901)类型。如果这可能，如何解决问题？

Float类型的温度值一切正常，没有波动......

如果有用，请在此处找到该程序的相应代码段：

Integer

编辑：使用JavaReceiverInputDStream<String> ingoingStream = streamingContext.socketTextStream(serverIp, 11833); // 2. Map the DStream<String> to a DStream<SensorData> by deserializing JSON objects JavaDStream<SensorData> sensorDStream = ingoingStream.map(new Function<String, SensorData>() { public SensorData call(String json) throws Exception { ObjectMapper om = new ObjectMapper(); return (SensorData)om.readValue(json, SensorData.class); } }).cache(); /************************************************ MOVIING AVERAGE OF TEMPERATURE *******************************************************************/ // Collect the data to a window of time (this is the time period for average calculation, older data is removed from stream!) JavaDStream<SensorData> windowMovingAverageSensorDataTemp = sensorDStream.window(windowSizeMovingAverageTemperature); windowMovingAverageSensorDataTemp.print(); // Map this SensorData stream to a new PairDStream, with key = deviceId (so we can make calculations by grouping by the id) // .cache the Stream, because we re-use it more than 1 time! JavaPairDStream<String, SensorData> windowMovingAverageSensorDataTempPairDStream = windowMovingAverageSensorDataTemp.mapToPair(new PairFunction<SensorData, String, SensorData>() { public Tuple2<String, SensorData> call(SensorData data) throws Exception { return new Tuple2<String, SensorData>(data.getIdSensor(), data); } }).cache(); // a) Map the PairDStream from above to a new PairDStream of form <deviceID, temperature> // b) Sum up the values to the total sum, grouped also by key (= device id) // => combined these two transactions, could also be called separately (like above) JavaPairDStream<String, Float> windowMovingAverageSensorDataTempPairDStreamSum = windowMovingAverageSensorDataTempPairDStream.mapToPair(new PairFunction<Tuple2<String,SensorData>, String, Float>() { public Tuple2<String, Float> call(Tuple2<String, SensorData> sensorDataPair) throws Exception { String key = sensorDataPair._1(); Float value = sensorDataPair._2().getValTemp(); return new Tuple2<String, Float>(key, value); } }).reduceByKey(new Function2<Float, Float, Float>() { public Float call(Float sumA, Float sumB) throws Exception { return sumA + sumB; } }); // a) Map the PairDStream from above to a new PairDStream of form <deviceID, 1L> to prepare the counting (1 = 1 entry) // b) Sum up the values to the total count of entries, grouped by key (= device id) // => also combined both calls JavaPairDStream<String, Long> windowMovingAverageSensorDataTempPairDStreamCount = windowMovingAverageSensorDataTempPairDStream.mapToPair(new PairFunction<Tuple2<String,SensorData>, String, Long>() { public Tuple2<String, Long> call(Tuple2<String, SensorData> sensorDataPair) throws Exception { String key = sensorDataPair._1(); Long value = 1L; return new Tuple2<String, Long>(key, value); } }).reduceByKey(new Function2<Long, Long, Long>() { public Long call(Long countA, Long countB) throws Exception { return countA + countB; } }); // Make a join of the sum and count Streams, so this puts together data with same keys (device id) // This results in a new PairDStream of <deviceID, <sumOfTemp, countOfEntries>> JavaPairDStream<String, Tuple2<Float, Long>> windowedTempJoinPairDStream = windowMovingAverageSensorDataTempPairDStreamSum.join(windowMovingAverageSensorDataTempPairDStreamCount).cache(); // Calculate the average temperature by avg = sumOfTemp / countOfEntries, do this for each key (device id) JavaPairDStream<String, Float> windowedTempAvg = windowedTempJoinPairDStream.mapToPair(new PairFunction<Tuple2<String,Tuple2<Float,Long>>, String, Float>() { public Tuple2<String, Float> call(Tuple2<String, Tuple2<Float, Long>> joinedData) throws Exception { String key = joinedData._1(); float tempSum = joinedData._2()._1(); long count = joinedData._2()._2(); float avg = tempSum / (float)count; return new Tuple2<String, Float>(key, avg); } }); // print the joined PairDStream from above to check sum & count visually windowedTempJoinPairDStream.print(); // print the final, calculated average values for each device id in form (deviceId, avgTemperature) windowedTempAvg.print(); // ========================================================= START THE STREAM ============================================================ // Start streaming & listen until stream is closed streamingContext.start(); streamingContext.awaitTermination();进行平均计算的Spark App：

刚刚更改了我的代码以使用StatCounter进行平均计算，但仍然得到不同的平均值：

StatCounter

这里是新的代码段：

-------------------------------------------
Time: 1473077627000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435302)

-------------------------------------------
Time: 1473077628000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435303)

-------------------------------------------
Time: 1473077629000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435301)

-------------------------------------------
Time: 1473077630000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435302)

-------------------------------------------
Time: 1473077631000 ms
-------------------------------------------
(1-2-a-b-c,0.4779797872435301)

-------------------------------------------
Time: 1473077632000 ms
-------------------------------------------
(1-2-a-b-c,0.47797978724353024)

-------------------------------------------
Time: 1473077633000 ms
-------------------------------------------
(1-2-a-b-c,0.47797978724353013)

Answer 1

至少乍一看这并不是特别奇怪。正如您已经建议的那样，这很可能是由于舍入错误造成的。由于FP算术既不是associative也不是可交换的，而且Spark shuffle是不确定的，我们可以预期结果会在不同的运行中波动。

你能做多少取决于你的约束：

对于初学者来说，直接计算平均值不是数值稳定的。最好使用o.a.s.util.StatCounter来实现the Online algorithm的变体，它具有更好的数值属性。
如果您负担得起，可以使用BigDecimal等任意精确数字。
最后通过一点点重新分配和辅助排序魔法强制执行总结顺序可以提供一致的（虽然不是必需的精确）结果。

Spark Streaming：PairDStream.print返回的不同平均值

1 个答案: