我正在尝试阅读带有时间字段的漂亮印刷json。我想在读取json本身时将timestamps列解释为timestamp字段。但是,当我printSchema
例如 输入json文件-
[{
"time_field" : "2017-09-30 04:53:39.412496Z"
}]
代码-
df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')
df.printSchema()
-的输出
root
|-- time_field: string (nullable = true)
我在这里想念什么?
答案 0 :(得分:1)
这是Spark版本2.4.0 Issues SPARK-26325
中的错误对于Spark版本2.4.4
WebView
输出
import org.apache.spark.sql.types.TimestampType
//String to timestamps
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")).toDF("input_timestamp")
val df_mod = df.select($"input_timestamp".cast(TimestampType))
df_mod.printSchema
答案 1 :(得分:0)
我对选项timestampFormat
的经验是,它不能像宣传的那样正常工作。我将简单地将时间字段读取为字符串,并使用to_timestamp
进行转换,如下所示(带有稍微概括的示例输入):
# /path/to/jsonfile
[{
"id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
"id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]
在Python中:
from pyspark.sql.functions import to_timestamp
df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df = df.withColumn("timestamp", to_timestamp("time_field"))
df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field |timestamp |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+
df.printSchema()
root
|-- id: long (nullable = true)
|-- time_field: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
在Scala中:
val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df.withColumn("timestamp", to_timestamp($"time_field"))