在整周的周开始日期(星期一)汇总。
窗口功能,我们无法在Spark中将开始日期作为星期一添加为一周的聚合数据。或其他解决方法。
df = spark.createDataFrame([
("001", "event1", 10, "2016-05-01 10:50:51"),
("002", "event2", 100, "2016-05-02 10:50:53"),
("001", "event3", 20, "2016-05-03 10:50:55"),
("010", "event3", 20, "2016-05-05 10:50:55"),
("001", "event1", 15, "2016-05-01 10:51:50"),
("003", "event1", 13, "2016-05-10 10:55:30"),
("001", "event2", 12, "2016-05-11 10:57:00"),
("001", "event3", 11, "2016-05-21 11:00:01"),
("002", "event2", 100, "2016-05-22 10:50:53"),
("001", "event3", 20, "2016-05-28 10:50:55"),
("001", "event1", 15, "2016-05-30 10:51:50"),
("003", "event1", 13, "2016-06-10 10:55:30"),
("001", "event2", 12, "2016-06-12 10:57:00"),
("001", "event3", 11, "2016-06-14 11:00:01")]).toDF("KEY", "Event_Type", "metric", "Time")
df2 = df.groupBy(window("Time", "7 day")).agg(sum("KEY").alias('aggregate_sum')).select("window.start", "window.end", "aggregate_sum").orderBy("window")
预期输出应为从星期一开始的一周的汇总数据。但是spark本身会从任何一天开始进行7天的一周汇总。
答案 0 :(得分:4)
Windows默认从1970-01-01(星期四)开始。您可以使用
window("Time", "7 day", startTime="4 days")
将其更改为星期一。