读取json时解释Spark中的时间戳字段

时间:2018-12-09 22:39:06

标签: json apache-spark timestamp

我正在尝试阅读带有时间字段的漂亮印刷json。我想在读取json本身时将timestamps列解释为timestamp字段。但是,当我printSchema

时,它仍以字符串形式读取它们

例如 输入json文件-

[{
    "time_field" : "2017-09-30 04:53:39.412496Z"
}]

代码-

df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')

df.printSchema()-的输出

root
 |-- time_field: string (nullable = true)

我在这里想念什么?

2 个答案:

答案 0 :(得分:1)

这是Spark版本2.4.0 Issues SPARK-26325

中的错误

对于Spark版本2.4.4

WebView

输出

import org.apache.spark.sql.types.TimestampType

//String to timestamps
val df = Seq(("2019-07-01 12:01:19.000"),
  ("2019-06-24 12:01:19.000"),
  ("2019-11-16 16:44:55.406"),
  ("2019-11-16 16:50:59.406")).toDF("input_timestamp")

val df_mod = df.select($"input_timestamp".cast(TimestampType))

df_mod.printSchema

答案 1 :(得分:0)

我对选项timestampFormat的经验是,它不能像宣传的那样正常工作。我将简单地将时间字段读取为字符串,并使用to_timestamp进行转换,如下所示(带有稍微概括的示例输入):

# /path/to/jsonfile
[{
    "id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
    "id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]

在Python中:

from pyspark.sql.functions import to_timestamp

df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")

df = df.withColumn("timestamp", to_timestamp("time_field"))

df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field                 |timestamp          |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+

df.printSchema()
root
 |-- id: long (nullable = true)
 |-- time_field: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

在Scala中:

val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")

df.withColumn("timestamp", to_timestamp($"time_field"))