我正在使用Spark 2.2结构化流式传输来读取CSV文件。 将结果写入控制台的查询是:
val consoleQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.writeStream
.format("console")
.option("truncate", value = false)
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Complete())
结果看起来不错:
+---------------------------------------------+-------------+-----+
|window |id |count|
+---------------------------------------------+-------------+-----+
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000001|1 |
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000002|8 |
|[2017-02-17 08:00:00.0,2017-02-17 09:00:00.0]|EXC2200002 |1 |
+---------------------------------------------+-------------+-----+
但是在将它写入Parquet文件时
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
并用另一份工作阅读,
val data = spark.read.parquet("src/main/resources/parquet/")
结果如下:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+
答案 0 :(得分:0)
TL; DR parquetQuery
未已启动,因此无法从流查询中输出。
查看parquetQuery
的{{1}}类型,它只是在某个时刻应该启动的查询的描述。由于它不是,查询从来没有能够做任何写入流的事情。
在start
声明的最后添加parquetQuery
(在调用链之后或作为调用链的一部分)。
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
.start // <-- that's what you miss