Question

示例如下：

df=spark.createDataFrame([
    (1,"2017-05-15 23:12:26",2.5),
    (1,"2017-05-09 15:26:58",3.5),
    (1,"2017-05-18 15:26:58",3.6),
    (2,"2017-05-15 15:24:25",4.8),
    (3,"2017-05-25 15:14:12",4.6)],["index","time","val"]).orderBy("index","time")
df.collect()

+-----+-------------------+---+
|index|               time|val|
+-----+-------------------+---+
|    1|2017-05-09 15:26:58|3.5|
|    1|2017-05-15 23:12:26|2.5|
|    1|2017-05-18 15:26:58|3.6|
|    2|2017-05-15 15:24:25|4.8|
|    3|2017-05-25 15:14:12|4.6|
+-----+-------------------+---+

用于函数“pyspark.sql.functions”

window(timeColumn, windowDuration, slideDuration=None, startTime=None)

timeColumn：The time column must be of TimestampType.

windowDuration：  Durations are provided as strings, e.g. '1 second', '1 day 12 hours', '2 minutes'. Valid
interval strings are 'week', 'day', 'hour', 'minute', 'second', 'millisecond', 'microsecond'.

slideDuration: If the 'slideDuration' is not provided, the windows will be tumbling windows.

startTime： the startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide `startTime` as `15 minutes`.

我想每隔5天在此函数中计算参数“val”，并将参数“slideDuration”设置为“5天”的字符串值

timeColumn="time",windowDuration="5 day",slideDuration="5 day"

代码如下：

df2=df.groupBy("index",F.window("time",windowDuration="5 day",slideDuration="5 day")).agg(F.sum("val").alias("sum_val"))

当我得到参数“window.start”的值时，时间没有从我在“时间”列中给出的最小时间或我之前设置的时间开始，但是从no到其中。

结果如下：

+-----+---------------------+---------------------+-------+
|index|start                |end                  |sum_val|
+-----+---------------------+---------------------+-------+
|1    |2017-05-09 08:00:00.0|2017-05-14 08:00:00.0|3.5    |
|1    |2017-05-14 08:00:00.0|2017-05-19 08:00:00.0|6.1    |
|2    |2017-05-14 08:00:00.0|2017-05-19 08:00:00.0|4.8    |
|3    |2017-05-24 08:00:00.0|2017-05-29 08:00:00.0|4.6    |
+-----+---------------------+---------------------+-------+

当我为参数“startTime”设置一个值为'0秒'时（代码如下）：

df2=df.groupBy("index",F.window("time",windowDuration="5 day",slideDuration="5 day",startTime="0 second")).agg(F.sum("val").alias("sum_val"))

+-----+---------------------+---------------------+-------+
|index|start                |end                  |sum_val|
+-----+---------------------+---------------------+-------+
|1    |2017-05-09 08:00:00.0|2017-05-14 08:00:00.0|3.5    |
|1    |2017-05-14 08:00:00.0|2017-05-19 08:00:00.0|6.1    |
|2    |2017-05-14 08:00:00.0|2017-05-19 08:00:00.0|4.8    |
|3    |2017-05-24 08:00:00.0|2017-05-29 08:00:00.0|4.6    |
+-----+---------------------+---------------------+-------+

结果表明，它仍然没有以“时间”栏中的最短时间开始

那么我该如何让这个功能以“时间”栏中的最短时间开始，或者我第一次设定的时间，例如“2017-05-09 15:25:30”，我是非常感谢你让我解决这个问题

官方介绍'startTime'如下

The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. 
For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15...
provide `startTime` as `15 minutes`.

参考文献如下

1。What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?

2。https://github.com/apache/spark/pull/12008

3。http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.functions.window

Answer 1

您遇到的问题与startTime完全无关，并且有两个组成部分：

Spark的timestamp semantics，其中时间戳总是根据本地时区处理。根据输出中显示的偏移量，我们得出结论，JVM使用GMT + 8或等效时区。请考虑以下两种情况：

>>> from pyspark.sql.functions import window
>>>
>>> spark.conf.get("spark.driver.extraJavaOptions")
'-Duser.timezone=GMT+8'
>>> spark.conf.get("spark.executor.extraJavaOptions")
'-Duser.timezone=GMT+8'
>>> str(spark.sparkContext._jvm.java.util.TimeZone.getDefault())
'sun.util.calendar.ZoneInfo[id="GMT+08:00",offset=28800000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]'
>>>
>>> df = spark.createDataFrame([(1,"2017-05-15 23:12:26",2.5)], ["index","time","val"])
>>> (df
...     .withColumn("w", window("time" ,windowDuration="5 days" ,slideDuration="5 days"))
...     .show(1, False))
...     
+-----+-------------------+---+---------------------------------------------+
|index|time               |val|w                                            |
+-----+-------------------+---+---------------------------------------------+
|1    |2017-05-15 23:12:26|2.5|[2017-05-14 08:00:00.0,2017-05-19 08:00:00.0]|
+-----+-------------------+---+---------------------------------------------+

VS

>>> from pyspark.sql.functions import window
>>>
>>> spark.conf.get("spark.driver.extraJavaOptions")
'-Duser.timezone=UTC'
>>> spark.conf.get("spark.executor.extraJavaOptions")
'-Duser.timezone=UTC'
>>> str(spark.sparkContext._jvm.java.util.TimeZone.getDefault())
'sun.util.calendar.ZoneInfo[id="UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]'
>>>
>>> df = spark.createDataFrame([(1,"2017-05-15 23:12:26",2.5)], ["index","time","val"])
>>> (df
...     .withColumn("w", window("time" ,windowDuration="5 days" ,slideDuration="5 days"))
...     .show(1, False))
... 
+-----+-------------------+---+---------------------------------------------+
|index|time               |val|w                                            |
+-----+-------------------+---+---------------------------------------------+
|1    |2017-05-15 23:12:26|2.5|[2017-05-14 00:00:00.0,2017-05-19 00:00:00.0]|
+-----+-------------------+---+---------------------------------------------+

正如您所见，输出是根据本地时区调整的，而输入字符串则被解析为UTC时间戳。

window语义。如果你看一下执行计划

>>> df.withColumn("w", window("time",windowDuration="5 days",slideDuration="5 days")).explain(False)
== Physical Plan ==
*Project [index#21L, time#22, val#23, window#68 AS w#67]
+- *Filter (((isnotnull(time#22) && isnotnull(window#68)) && (cast(time#22 as timestamp) >= window#68.start)) && (cast(time#22 as timestamp) < window#68.end))
   +- *Expand [List(named_struct(start, ((((CEIL((cast((precisetimestamp(cast(time#22 as timestamp)) - 0) as double) / 4.32E11)) + 0) - 1) * 432000000000) + 0), end, ((((CEIL((cast((precisetimestamp(cast(time#22 as timestamp)) - 0) as double) / 4.32E11)) + 0) - 1) * 432000000000) + 432000000000)), index#21L, time#22, val#23), List(named_struct(start, ((((CEIL((cast((precisetimestamp(cast(time#22 as timestamp)) - 0) as double) / 4.32E11)) + 1) - 1) * 432000000000) + 0), end, ((((CEIL((cast((precisetimestamp(cast(time#22 as timestamp)) - 0) as double) / 4.32E11)) + 1) - 1) * 432000000000) + 432000000000)), index#21L, time#22, val#23)], [window#68, index#21L, time#22, val#23]
      +- Scan ExistingRDD[index#21L,time#22,val#23]

并专注于单一组件：

((((CEIL((cast((precisetimestamp(cast(time#22 as timestamp)) - 0) as double) / 4.32E11)) + 0) - 1) * 432000000000)

你会看到该窗口采用数值上限，有效地将时间戳截断为整个时间间隔。

最后{/ 1}}

startTime

完全没有效果，因为它表现得像默认（没有偏移）。如果有什么可以尝试：

df.groupBy("index",F.window("time",windowDuration="5 day",slideDuration="5  day",startTime="0 second"))

(startTime, ) = (df
    .select(min_(col("time").cast("timestamp")).alias("ts"))
    .select(
       ((col("ts").cast("double") - 
       col("ts").cast("date").cast("timestamp").cast("double")
      ) * 1000).cast("integer"))
     .first())

w = window(
    "time", 
    windowDuration="5 days",
    slideDuration="5 days",
    startTime="{} milliseconds".format(startTime))


df.withColumn("w", w).show(1, False)

pyspark.sql.functions.window函数的'startTime'参数和window.start有什么关系？

参考文献如下

1 个答案: