Spark Streaming滚动窗口每小时间隔

时间:2018-08-08 09:02:45

标签: apache-spark spark-structured-streaming

关于Spark滚动窗口的快速问题。假设我有一些有关每家商店销售的信息:

val testShopStats = Seq(SalesStatsPerShop(1465430402000l, "2016-06-09 00:00:00", "2016-06-09 01:00:00", "NewYorker", 120.0, "Germany", "Munich"))

在时间戳1465430402000以上的情况下,对应于2016年6月9日星期四00:00:02

我正在尝试计算不同小时间隔的平均销售额,如下所示:

val preAggregatedAverage =
  lines
    .withWatermark("timeStamp", "2 hours")
    .groupBy($"shop", $"country", $"city", window($"timeStamp", "2 hours"))
    .agg(avg($"amount") as "avgAmount")

我希望每次指定小时窗口时,我都会从00:00开始窗口,但这并不总是这样:

window = 1 hour
+---------+-------+------------------------------------------+
|shop     |country|window                                    |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-09 00:00:00, 2016-06-09 01:00:00]|
+---------+-------+------------------------------------------+

window = 2 hours
+---------+-------+------------------------------------------+
|shop     |country|window                                    |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-08 23:00:00, 2016-06-09 01:00:00]|
+---------+-------+------------------------------------------+

window = 4 hours
+---------+-------+------------------------------------------+
|shop     |country|window                                    |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-08 23:00:00, 2016-06-09 03:00:00]|
+---------+-------+------------------------------------------+

window = 6 hours
+---------+-------+------------------------------------------+
|shop     |country|window                                    |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-08 21:00:00, 2016-06-09 03:00:00]|
+---------+-------+------------------------------------------+

问题是:在每小时间隔的情况下,如何确定窗口开始日期?

谢谢

0 个答案:

没有答案