Question

我想使用to_timestamp函数来格式化pyspark中的时间戳。如何在不更改时区或省略某些日期的情况下进行操作。？

from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf, to_timestamp

date_format = "yyyy-MM-dd'T'HH:mm:ss"

vals = [('2018-03-11T02:39:00Z'), ('2018-03-11T01:39:00Z'), ('2018-03-11T03:39:00Z')]
testdf = spark.createDataFrame(vals, StringType())
testdf.withColumn("to_timestamp", to_timestamp("value",date_format)).show(4,False)


testdf.withColumn("to_timestamp", to_timestamp("value", date_format)).show(4,False)
+--------------------+-------------------+                                      
|value               |to_timestamp       |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|null               |
|2018-03-11T01:39:00Z|2018-03-11 01:39:00|
|2018-03-11T03:39:00Z|2018-03-11 03:39:00|
+--------------------+-------------------+

我希望2018-03-11T02:39:00Z正确格式化为2018-03-11 02:39:00

然后我切换到默认的to_timestamp功能。

testdf.withColumn("to_timestamp", to_timestamp("value")).show(4,False)`

+--------------------+-------------------+
|value               |to_timestamp       |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|2018-03-10 20:39:00|
|2018-03-11T01:39:00Z|2018-03-10 19:39:00|
|2018-03-11T03:39:00Z|2018-03-10 21:39:00|
+--------------------+-------------------+

Answer 1

使用默认值调用to_timestamp()时的时间偏移是因为您将火花实例设置为本地时区而不是UTC。您可以通过运行检查 spark.conf.get('spark.sql.session.timeZone')

如果要以UTC显示时间戳，请设置conf值。 spark.conf.set('spark.sql.session.timeZone', 'UTC')

代码中的另一个要点，当您将日期格式定义为"yyyy-MM-dd'T'HH:mm:ss"时，您实际上是在要求spark忽略时区，并考虑所有时间戳记都在UTC / Zulu中。正确的格式应该是date_format = "yyyy-MM-dd'T'HH:mm:ssXXX"，但是如果您使用默认值调用to_timestamp（），那么这是有争议的。

Answer 2

使用 from_utc_timestamp 方法将输入列值视为 UTC 时间戳

testdf.withColumn("to_timestamp", from_utc_timestamp("value")).show(4,False)

pyspark to_timestamp函数不转换某些时间戳

2 个答案: