我正在对数据进行分组,对它们进行排序,并显示每个组的前几列。我正在使用Windows函数,但不适用于流数据帧,该怎么办?
Before Aggregation:
+----------------+------------+----------+-----------+---------+----+--------------------+
| orderId|execQuantity|lastMarket| fundRef| symbol|side| event_time|
+----------------+------------+----------+-----------+---------+----+--------------------+
|dAA0556-20180122| 100.0| ITGC|BLBLK3_2822|895945103| BUY|2018-02-21 07:15:...|
|dAA0557-20180122| 60.0| ITGI|BLBLK3_2822|895945103| BUY|294247-01-10 12:0...|
|dAA0557-20180122| 200.0| XNYS|BLBLK3_2822|895945103| BUY|294247-01-10 12:0...|
|dAA0557-20180122| 50.0| JPMX|BLBLK3_2822|895945103| BUY|294247-01-10 12:0...|
|dAA0557-20180122| 30.0| BATS|BLBLK3_2822|895945103| BUY|294247-01-10 12:0...|
|dAA0557-20180122| 10.0| XNYS|BLBLK3_2822|895945103| BUY|294247-01-10 12:0...|
|dAA0557-20180122| 70.0| JPMX|BLBLK3_2822|895945103| BUY|294247-01-10 12:0...|
+----------------+------------+----------+-----------+---------+----+--------------------+
汇总后:
val executionDataQuery = withTime
.selectExpr("orderId", "execQuantity", "lastMarket", "fundRef", "symbol", "side", "event_time")
.withWatermark("event_time", "1 minute")
.groupBy(col("symbol"), col("orderID"), col("lastMarket"))
.sum("execQuantity").writeStream.queryName("executionDataQuery").format("memory").outputMode("complete").start()
println(LocalDateTime.now() + "\nSELECT symbol, orderID, lastMarket, sum(execQuantity) FROM executionDataQuery")
spark.sql("SELECT * FROM executionDataQuery").show
SELECT symbol, orderID, lastMarket, sum(execQuantity) FROM executionDataQuery
+---------+----------------+----------+-----------------+
| symbol| orderID|lastMarket|sum(execQuantity)|
+---------+----------------+----------+-----------------+
|895945103|dAA0557-20180122| ITGI| 60.0|
|895945103|dAA0557-20180122| XNYS| 210.0|
|895945103|dAA0557-20180122| JPMX| 120.0|
|895945103|dAA0557-20180122| BATS| 30.0|
|895945103|dAA0556-20180122| ITGC| 100.0|
+---------+----------------+----------+-----------------+
具有窗口功能的聚合:
val windowSpec = Window
.partitionBy(col("orderID"), col("lastMarket"), col("fundRef"), col("symbol"), col("side"))
.orderBy(col("aggExecQuantity").desc)
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val executionDataQuery = withTime
.selectExpr("orderId", "execQuantity", "lastMarket", "fundRef", "symbol", "side", "event_time")
.withWatermark("event_time", "1 minute")
.groupBy(col("orderID"), col("lastMarket"), col("fundRef"), col("symbol"), col("side"))
.agg(expr("sum(execQuantity) as aggExecQuantity"))
val executionDataQuery1 = executionDataQuery.select(col("orderId"), col("lastMarket"), col("fundRef"), col("symbol"), col("side"),
dense_rank.over(windowSpec)).alias("aggExeQuantityRank")
val executionDataQuery2 = executionDataQuery1.writeStream.queryName("executionDataQuery").format("memory").outputMode("complete").start()
spark.sql("SELECT * FROM executionDataQuery").show
我收到此错误: org.apache.spark.sql.AnalysisException:流数据帧/数据集不支持非基于时间的窗口;