Question

我目前有一个结构化流数据框架，该框架将时间戳记的计数汇总到每小时类别中（每小时Windows）。虽然很高兴看到计数，但我的目标是找到一些有关正在运行的查询的信息，例如最大计数的小时，最小计数，每小时平均时间戳数，24小时内的总计数等。

我的问题是如何访问和查询正在运行的活动Streaming DataFrame以提供此类信息。更进一步，我想在整行中输入最大和最小条目（例如，产生最高/最低时间戳计数的小时以及所计数的时间戳数量）

理想情况下，我想在Streaming DataFrame中显示每小时的时间戳计数不断增长，并在每次批处理后提供max，min等。但是，如果不能同时进行最大，最小等操作，我就完全可以。

作为参考，这是我正在使用的示例代码以及一些伪代码，以显示我要执行的操作。

val spark = SparkSession.builder.appName("Sample").getOrCreate
import spark.implicits._
val stream = spark.option("maxFilesPerTrigger", 1).text("file:///path/location")
val extracted = stream.select(extract_udf($value).cast(TimestampType) as 
  "timestamp").groupBy(window($"timestamp", "1 hour"), hour($"timestamp") as
  "hour").count.sort($"window")
val query = extracted.writeStream.queryName("time_table").outputMode("complete").
  format("console").start
query.awaitTermination

所需的伪代码

val max = findMaxRow(query) // Would extract from the row for a tuple (hour, count)
val min = // same as max
val runningCount = findTotalCount(query) // Would return the current count
val stats = List(max, min, count, etc)
val statsDF = stats.toDF
// Code to display stats along with the queried DataFrame
statsDF.writeStream.... ???

在没有人提及它之前，我已经进行了尽可能多的搜索，没有找到答案。 This post是我能找到的最接近的答案，但没有解决如何查询活动流式DataFrame以及如何同时使流式DataFrame和统计数据DataFrame的问题。

无论是无法做到还是需要采用其他方法，我们都会提供帮助。

谢谢！

在结构化流数据帧上执行统计查询

0 个答案: