关于Spark滚动窗口的快速问题。假设我有一些有关每家商店销售的信息:
val testShopStats = Seq(SalesStatsPerShop(1465430402000l, "2016-06-09 00:00:00", "2016-06-09 01:00:00", "NewYorker", 120.0, "Germany", "Munich"))
在时间戳1465430402000以上的情况下,对应于2016年6月9日星期四00:00:02
我正在尝试计算不同小时间隔的平均销售额,如下所示:
val preAggregatedAverage =
lines
.withWatermark("timeStamp", "2 hours")
.groupBy($"shop", $"country", $"city", window($"timeStamp", "2 hours"))
.agg(avg($"amount") as "avgAmount")
我希望每次指定小时窗口时,我都会从00:00开始窗口,但这并不总是这样:
window = 1 hour
+---------+-------+------------------------------------------+
|shop |country|window |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-09 00:00:00, 2016-06-09 01:00:00]|
+---------+-------+------------------------------------------+
window = 2 hours
+---------+-------+------------------------------------------+
|shop |country|window |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-08 23:00:00, 2016-06-09 01:00:00]|
+---------+-------+------------------------------------------+
window = 4 hours
+---------+-------+------------------------------------------+
|shop |country|window |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-08 23:00:00, 2016-06-09 03:00:00]|
+---------+-------+------------------------------------------+
window = 6 hours
+---------+-------+------------------------------------------+
|shop |country|window |
+---------+-------+------------------------------------------+
|NewYorker|Germany|[2016-06-08 21:00:00, 2016-06-09 03:00:00]|
+---------+-------+------------------------------------------+
问题是:在每小时间隔的情况下,如何确定窗口开始日期?
谢谢