Question

在pyspark中，可以通过传递时间戳和时区to the function from_utc_timestamp

来从UTC时间获取本地时间。

>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(from_utc_timestamp(df.t, "PST").alias('t')).collect()
[Row(t=datetime.datetime(1997, 2, 28, 2, 30))]

此处的时区以字符串文字（“ PST”）的形式提供。如果要具有以下数据结构：

+--------------------------+---------+
| utc_time                 |timezone |
+--------------------------+---------+
|  2018-08-03T23:27:30.000Z|  PST    |
|  2018-08-03T23:27:30.000Z|  GMT    |
|  2018-08-03T23:27:30.000Z|  SGT    |
+--------------------------+---------+

一个人如何才能获得以下新专栏文章（最好是没有UDF）？

+--------------------------+-----------------------------------+
| utc_time                 |timezone | local_time              |
+--------------------------+-----------------------------------+
|  2018-08-03T23:27:30.000Z|  PST    | 2018-08-03T15:27:30.000 |
|  2018-08-03T23:27:30.000Z|  GMT    | 2018-08-04T00:27:30.000 |
|  2018-08-03T23:27:30.000Z|  SGT    | 2018-08-04T07:27:30.000 |
+--------------------------+-----------------------------------+

Answer 1

使用pyspark.sql.functions.expr() rather the the dataframe API，可以通过以下方式实现：

import pyspark.sql.functions as F

df = df.select(
    '*',
    F.expr('from_utc_timestamp(utc_time, timezone)').alias("timestamp_local")
)

但是，不建议使用3个字母的时区。根据{{3}}：

为了与JDK 1.1.x兼容，还支持其他三个字母的时区ID（例如“ PST”，“ CTT”，“ AST”）。但是，不赞成使用它们，因为通常在多个时区使用相同的缩写（例如，“ CST”可以是美国的“ Central Standard Time”和“ China Standard Time”），并且Java平台只能识别以下一种他们。

在pyspark中获取本地时间取决于列

1 个答案: