我正在尝试汇总Spark时间戳结构化流,以获取每秒钟传入数据的每个设备(源)平均值。
dataset.printSchema(); // see the output below
Dataset<Row> ds1 = dataset
.withWatermark("timestamp", "1 second")
.groupBy(
functions.window(dataset.col("timestamp"), "1 second", "1 second"),
dataset.col("source"))
.agg(
functions.avg("D0").as("AVG_D0"),
functions.avg("I0").as("AVG_I0"))
.orderBy("window");
StreamingQuery query = ds1.writeStream()
.outputMode(OutputMode.Append())
.format("console")
.option("truncate", "false")
.option("numRows", Integer.MAX_VALUE)
.start();
query.awaitTermination();
我正在使用Spark 2.4.6。
根据
https://spark.apache.org/docs/2.4.6/structured-streaming-programming-guide.html#output-modes
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
上述构造应该可以正常工作。
但是我在start()中遇到了异常:
11:05:27.282 [main] ERROR my.sparkbench.example.Example - Exception
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Sort [window#44 ASC NULLS FIRST], true
+- Aggregate [window#71, source#0], [window#71 AS window#44, source#0, avg(D0#12) AS AVG_D0#68, avg(I0#2L) AS AVG_I0#70]
+- Filter isnotnull(timestamp#1)
+- Project [named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) END + cast(0 as bigint)) - cast(1 as bigint)) * 1000000) + 0), LongType, TimestampType), end, precisetimestampconversion((((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / cast(1000000 as double))) END + cast(0 as bigint)) - cast(1 as bigint)) * 1000000) + 0) + 1000000), LongType, TimestampType)) AS window#71, source#0, timestamp#1-T1000ms, I0#2L, I1#3L, I2#4L, I3#5L, I4#6L, I5#7L, I6#8L, I7#9L, I8#10L, I9#11L, D0#12, D1#13, D2#14, D3#15, D4#16, D5#17, D6#18, D7#19, D8#20, D9#21]
+- EventTimeWatermark timestamp#1: timestamp, interval 1 seconds
+- StreamingRelationV2 my.sparkbench.datastreamreader.MyStreamingSource@6897a4a, my.sparkbench.datastreamreader.MyStreamingSource, [source#0, timestamp#1, I0#2L, I1#3L, I2#4L, I3#5L, I4#6L, I5#7L, I6#8L, I7#9L, I8#10L, I9#11L, D0#12, D1#13, D2#14, D3#15, D4#16, D5#17, D6#18, D7#19, D8#20, D9#21]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:111)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:256)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:322)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:325)
at my.sparkbench.example.Example.streamGroupByResult(Example.java:113)
at my.sparkbench.example.Example.exec_main(Example.java:76)
at my.sparkbench.example.Example.do_main(Example.java:42)
at my.sparkbench.example.Example.main(Example.java:34)
模式打印输出看起来不错:
root
|-- source: string (nullable = false)
|-- timestamp: timestamp (nullable = false)
|-- I0: long (nullable = false)
|-- I1: long (nullable = false)
|-- I2: long (nullable = false)
|-- I3: long (nullable = false)
|-- I4: long (nullable = false)
|-- I5: long (nullable = false)
|-- I6: long (nullable = false)
|-- I7: long (nullable = false)
|-- I8: long (nullable = false)
|-- I9: long (nullable = false)
|-- D0: double (nullable = false)
|-- D1: double (nullable = false)
|-- D2: double (nullable = false)
|-- D3: double (nullable = false)
|-- D4: double (nullable = false)
|-- D5: double (nullable = false)
|-- D6: double (nullable = false)
|-- D7: double (nullable = false)
|-- D8: double (nullable = false)
|-- D9: double (nullable = false)
实际数据也很好。 如果我将其喂给
dataset.writeStream().format("console").option("truncate", "false").outputMode(OutputMode.Append()).start();
我正在输出
-------------------------------------------
Batch: 0
-------------------------------------------
+--------+---------------------+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+
|source |timestamp |I0 |I1 |I2 |I3 |I4 |I5 |I6 |I7 |I8 |I9 |D0 |D1 |D2 |D3 |D4 |D5 |D6 |D7 |D8 |D9 |
+--------+---------------------+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+
|DEV-0001|1970-01-01 00:01:40 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0002|1970-01-01 00:01:40 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0003|1970-01-01 00:01:40 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0004|1970-01-01 00:01:40 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0001|1970-01-01 00:01:40.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0002|1970-01-01 00:01:40.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0003|1970-01-01 00:01:40.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0004|1970-01-01 00:01:40.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0001|1970-01-01 00:01:41 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0002|1970-01-01 00:01:41 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0003|1970-01-01 00:01:41 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0004|1970-01-01 00:01:41 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0001|1970-01-01 00:01:41.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0002|1970-01-01 00:01:41.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0003|1970-01-01 00:01:41.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0004|1970-01-01 00:01:41.5|10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0001|1970-01-01 00:01:42 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0002|1970-01-01 00:01:42 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0003|1970-01-01 00:01:42 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
|DEV-0004|1970-01-01 00:01:42 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10 |10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|10.0|
+--------+---------------------+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+
only showing top 20 rows
,然后进行后续批处理。
如果我使用COMPLETE输出模式也不例外,但是每批次都会报告旧结果(从时间轴开始),而这并不是我想要的。 我只想报告新的查询结果记录。 因此,我需要APPEND模式-但这会导致异常。
为什么是例外,我该如何使其工作?
解决问题的微型项目在这里: https://github.com/oboguev/SparkQuestion
感谢您的咨询!