当我使用窗口操作应用mapGroupsWithState的结果来获取聚合计数几个字段时,我收到错误。
输入遵循以下模式,其中可能存在许多具有不同时间戳和状态值的相同id的事件
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- state: int (nullable = true)
例如:
event("abc", "a", 1, 1)
event("abc", "a", 2, 2)
event("def", "b", 1, 1)
event("def", "b", 2, 1)
event("ghi", "b", 1, 1)
通过使用mapGroupsWithState,我只保留每个id的最新出现时间戳,结果模式是相同的,但不会有重复的id,每行将包含最新的事件
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- state: int (nullable = true)
上述事件的结果:
event("abc", "a", 2, 2)
event("def", "b", 2, 1)
event("ghi", "b", 1, 1)
最后我应用groupby窗口操作来聚合一个位置内每个唯一状态的计数,以获得以下模式:
root
|-- location: string (nullable = true)
|-- state1: long (nullable = false)
|-- state2: long (nullable = false)
查询如下所示:
val aggDemand = df
.select($"id", $"location", $"timestamp", $"state")
.withWatermark("timestamp", "10 seconds")
.groupBy(functions.window($"timestamp", DataConstant.t15min.toString + " seconds", DataConstant.t1min.toString + " seconds"), $"location")
.agg(count(when($"state" === 1L, $"state")) as 'state1, count(when($"state" === 2L, $"state")) as 'state2)
.filter(unix_timestamp($"window.end".cast(TimestampType)) <= unix_timestamp(from_utc_timestamp(current_timestamp(), "UTC+08:00")) + DataConstant.t1min)
.filter(unix_timestamp($"window.end".cast(TimestampType)) > unix_timestamp(from_utc_timestamp(current_timestamp(), "UTC+08:00")))
.drop($"window")
针对来自kafka的流式数据框/数据集运行时遇到以下错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;;
目的是获得以下结果:
location | state 1 | state 2
-----------------------------
a | 0 | 1
b | 2 | 0
该方法适用于批处理模式,但似乎无法进行流式查询。 查询有什么问题?如何实现所需的结果?在执行窗口操作之前,是否需要存储来自mapGroupsWithState的结果?
感谢任何帮助!
答案 0 :(得分:0)
对Struct Stream有很多限制。它不是Spark Streaming的替代选择。
在Spark Streaming中,您可以在mapWithState函数中以相同的结果完成问题。
检查此链接
case m: FlatMapGroupsWithState if m.isStreaming =>
// Check compatibility with output modes and aggregations in query
val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)
if (m.isMapGroupsWithState) { // check mapGroupsWithState
// allowed only in update query output mode and without aggregation
if (aggsAfterFlatMapGroups.nonEmpty) {
throwError(
"mapGroupsWithState is not supported with aggregation " +
"on a streaming DataFrame/Dataset")
} else if (outputMode != InternalOutputModes.Update) {
throwError(
"mapGroupsWithState is not supported with " +
s"$outputMode output mode on a streaming DataFrame/Dataset")
}