我想通过使用Spark结构化流来找到相似系统数量的每个窗口框架之间的平均值(概念类似于滚动平均值)。但是,结构化流式传输不允许我们对相似的列进行计数并根据该计数值找到平均值,它将出现错误,指出
Multiple streaming aggregations are not supported with streaming DataFrames/Datasets
Input
+--------------------+--------------------+
| window| system|
+--------------------+--------------------+
|[2019-06-21 09:23...|A |
|[2019-06-21 09:23...|A |
|[2019-06-21 09:24...|A |
|[2019-06-21 09:24...|B |
|[2019-06-21 09:24...|B |
|[2019-06-21 09:25...|C |
+--------------------+--------------------+
Output
+--------------------+--------------------+-----+-----+
| window| system|count|avg |
+--------------------+--------------------+-----+-----+
|[2019-06-21 09:23...|A | 2| 2|
|[2019-06-21 09:24...|A | 1| 1.5|
|[2019-06-21 09:24...|B | 2| 2|
|[2019-06-21 09:25...|C | 1| 1|
+--------------------+--------------------+-----+-----+
我已经尝试将其收集到HDFS中并再次使用以分别进行另一次聚合(这不是首选解决方案,因为这会浪费大量时间和存储空间),但是仍然存在问题 尝试应用水印时“ window.start”。
这是架构
root
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- system: string (nullable = true)
|-- count: long (nullable = true)
avg_data=raw_data\
.withWatermark("window.start", "1 minute")\
.groupBy("system")\
.agg(avg("count").alias("avg"))
An error occurred while calling o667.withWatermark.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
'EventTimeWatermark 'window.start, interval 1 minutes
由于我是在结构化流媒体中进行的,因此在这种情况下,不支持Window,因为
Non-time-based windows are not supported on streaming DataFrames/Datasets;
这是我对相似主题进行分组/计数时的代码。
groupped_data=raw_data\
.withWatermark("timestamp", "1 minute")\
.groupBy(window('timestamp', "1 minute", "1 minute"),"system")\
.agg(count("system").alias("count"))\