Question

我的数据如下：

{"id":1,"createdAt":"2016-07-01T16:37:41-0400"}
{"id":2,"createdAt":"2016-07-01T16:37:41-0700"}
{"id":3,"createdAt":"2016-07-01T16:37:41-0400"}
{"id":4,"createdAt":"2016-07-01T16:37:41-0700"}
{"id":5,"createdAt":"2016-07-06T09:48Z"}
{"id":6,"createdAt":"2016-07-06T09:48Z"}
{"id":7,"createdAt":"2016-07-06T09:48Z"}

我正在将createdAt字段转换为时间戳，如下所示。

from pyspark.sql import SQLContext
from pyspark.sql.functions import *

sqlContext = SQLContext(sc)
df = sqlContext.read.json('data/test.json')
dfProcessed = df.withColumn('createdAt', df.createdAt.cast('timestamp'))

dfProcessed.printSchema()
dfProcessed.collect()

我得到的输出如下。我为createdAt获得了无值。我可以做些什么来将字段作为正确的时间戳进行检索？

root
 |-- createdAt: timestamp (nullable = true)
 |-- id: long (nullable = true)

[Row(createdAt=None, id=1),
 Row(createdAt=None, id=2),
 Row(createdAt=None, id=3),
 Row(createdAt=None, id=4),
 Row(createdAt=None, id=5),
 Row(createdAt=None, id=6),
 Row(createdAt=None, id=7)]

Answer 1

为了简单地将字符串列强制转换为时间戳，必须正确格式化字符串列。

检索＆＃34; createdAt＆＃34;列作为时间戳，您可以编写将转换字符串

的UDF函数

＆＃34; 2016-07-01T16：37：41-0400＆＃34;

到

＆＃34; 2016-07-01 16：37：41＆＃34;

并转换＆＃34; createdAt＆＃34;列到新格式（不要忘记处理时区字段）。

一旦你有一个包含时间戳的列作为＆＃34; 2016-07-01 16：37：41＆＃34;等字符串，一个简单的时间戳转换就可以完成这项工作，就像你在代码中一样。< / p>

您可以在Spark here中阅读有关日期/时间/字符串处理的更多信息。

pyspark Dataframe API强制转换（'timestamp'）对时间戳字符串不起作用

1 个答案: