Question

我的DataFrame如下所示：

+----------------+-------------+
|   Business_Date|         Code|
+----------------+-------------+
|1539129600000000|          BSD|
|1539129600000000|          BTN|
|1539129600000000|          BVI|
|1539129600000000|          BWP|
|1539129600000000|          BYB|
+----------------+-------------+

我想在将数据加载到配置单元表时将Business_Date列从bigint转换为timestamp的值。

我该怎么做？

Answer 1

您可以使用pyspark.sql.functions.from_unixtime()，

以给定的格式将从Unix纪元（1970-01-01 00:00:00 UTC）的秒数转换为表示该时刻在当前系统时区中的时间戳的字符串。

您的Business_Date似乎需要除以1M才能转换为秒。

例如：

from pyspark.sql.functions import from_unixtime, col

df = df.withColumn(
    "Business_Date",
    from_unixtime(col("Business_Date")/1000000).cast("timestamp")
)
df.show()
#+---------------------+----+
#|Business_Date        |Code|
#+---------------------+----+
#|2018-10-09 20:00:00.0|BSD |
#|2018-10-09 20:00:00.0|BTN |
#|2018-10-09 20:00:00.0|BVI |
#|2018-10-09 20:00:00.0|BWP |
#|2018-10-09 20:00:00.0|BYB |
#+---------------------+----+

from_unixtime返回一个字符串，因此您可以将结果强制转换为timestamp。

现在使用新架构：

df.printSchema()
#root
# |-- Business_Date: timestamp (nullable = true)
# |-- Code: string (nullable = true)

如何在SparkSQL数据帧中将实木复合地板文件的int64数据类型列转换为时间戳？

1 个答案: