我每天在sqlContext.read.parquet
中使用PySpark
函数来读取parquet
文件。数据具有一个timestamp
列。他们将时间戳记字段从2019-08-26T00:00:13.600+0000
更改为2019-08-26T00:00:13.600Z
。在Databricks中,它读起来很好,但是当我尝试通过Spark集群读取它时,它给出了Illegal Parquet type: INT64 (TIMESTAMP_MICROS)
错误。我该如何使用read.parquet
函数本身读取这一新列?
当前,我使用from_unixtime(unix_timestamp(ts,"yyyy-MM-dd HH:mm:ss.SSS"),"yyyy-MM-dd")
作为ts来将2019-08-26T00:00:13.600+0000
转换为2019-08-26
格式。
如何将2019-08-26T00:00:13.600Z
转换为2019-08-26
?
答案 0 :(得分:0)
这是scala版本
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val df2 = Seq(("a3fac", "2019-08-26T00:00:13.600Z")).toDF("id", "eventTime")
val df3= df2.withColumn("eventTime1", to_date(unix_timestamp($"eventTime", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast(TimestampType)))
df3.show(false)
+-----+------------------------+----------+
|id |eventTime |eventTime1|
+-----+------------------------+----------+
|a3fac|2019-08-26T00:00:13.600Z|2019-08-26|
+-----+------------------------+----------+
以下行将时区日期转换为日期
to_date(unix_timestamp($"eventTime", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast(TimestampType))
pyspark版本:
>>> from pyspark.sql.functions import col, to_date,unix_timestamp
>>> df2=spark.createDataFrame([("a3fac", "2019-08-26T00:00:13.600Z")], ['id', 'eventTime'])
>>> df3=df2.withColumn("eventTime1", to_date(unix_timestamp(col("eventTime"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast('timestamp')))
>>> df3.show()
+-----+--------------------+----------+
| id| eventTime|eventTime1|
+-----+--------------------+----------+
|a3fac|2019-08-26T00:00:...|2019-08-26|
+-----+--------------------+----------+
答案 1 :(得分:0)
您可以从功能模块使用to_date api
import pyspark.sql.functions as f
dfl2 = spark.createDataFrame([(1, "2019-08-26T00:00:13.600Z"),]).toDF('col1', 'ts')
dfl2.show(1, False)
+----+------------------------+
|col1|ts |
+----+------------------------+
|1 |2019-08-26T00:00:13.600Z|
+----+------------------------+
dfl2.withColumn('date',f.to_date('ts', "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(1, False)
+----+------------------------+----------+
|col1|ts |date |
+----+------------------------+----------+
|1 |2019-08-26T00:00:13.600Z|2019-08-26|
+----+------------------------+----------+
dfl2.withColumn('date',f.to_date('ts', "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).printSchema()
root
|-- col1: long (nullable = true)
|-- ts: string (nullable = true)
|-- date: date (nullable = true)