按使用的时间窗口分组时,Spark如何确定第一个窗口的window.start?

时间:2018-07-27 19:27:14

标签: apache-spark apache-spark-sql spark-streaming

这里的数据示例:

scala> purchases.show(false)
+---------+-------------------+--------+
|client_id|transaction_ts     |store_id|
+---------+-------------------+--------+
|1        |2018-06-01 12:17:37|1       |
|1        |2018-06-02 13:17:37|2       |
|1        |2018-06-03 14:17:37|3       |
|1        |2018-06-09 10:17:37|2       |
|2        |2018-06-02 10:17:37|1       |
|2        |2018-06-02 13:17:37|2       |
|2        |2018-06-08 14:19:37|3       |
|2        |2018-06-16 13:17:37|2       |
|2        |2018-06-17 14:17:37|3       |
+---------+-------------------+--------+

当我按时间窗口分组时:

scala> purchases.groupBy($"client_id", window($"transaction_ts", "8 days")).count.orderBy("client_id", "window.start")show(false)

+---------+---------------------------------------------+-----+                 
|client_id|window                                       |count|
+---------+---------------------------------------------+-----+
|1        |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|3    |
|1        |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|1    |
|2        |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|2    |
|2        |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|1    |
|2        |[2018-06-13 17:00:00.0,2018-06-21 17:00:00.0]|2    |
+---------+---------------------------------------------+-----+

我想知道为什么第一个window.start2018-05-28 17:00:00.0而数据中的最小值是2018-06-01 12:17:37

Spark如何计算时间窗口?我期望第一个最小值将用作min window.start ...

1 个答案:

答案 0 :(得分:0)

感谢@ user8371915!

在发现建议的链接之后,我一直在寻找答案,特别是>>> X = np.arange(27).reshape(3,3,3) >>> X[-1] array([[18, 19, 20], [21, 22, 23], [24, 25, 26]]) >>> X[-1].shape (3, 3) >>> X[1].shape[0] 3 >>> np.zeros(X[1].shape[0]) array([ 0., 0., 0.]) 的数据不存在,Spark会从window.start开始生成窗口。请参阅What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?

中的更多详细信息