im与pyspark一起使用Spark结构化流媒体。
我有一个具有以下格式的字符串:
2020-04-21T11:28:40.321328+00:00
我需要将日期格式更改为yyyy-MM-dd HH:mm:ss,我正在尝试这样做:
date_format(to_timestamp('value.Ticker.time', "yyyy-MM-dd'T'HH:mm:ss.sssssssZ"), "yyyy-MM-dd HH:mm:ss")
但结果为空:
我的代码是:
BytesDF_Data_Level_2 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "data_level_2") \
.load()
StringDF_Data_Level_2 = BytesDF_Data_Level_2.selectExpr("CAST(value AS STRING)")
JsonDF_Data_Level_2 = StringDF_Data_Level_2.withColumn("value", from_json("value", schema_data_level_II))
JsonDF_cols_Data_Level_2 = JsonDF_Data_Level_2.select(
#col('value.Ticker.contract.Forex.tradingClass'),
col('value.Ticker.time'),
date_format(to_timestamp('value.Ticker.time', "yyyy-MM-dd'T'HH:mm:ss.sssssssZ"), "yyyy-MM-dd HH:mm:ss")
#col('value.Ticker.bid'),
#col('value.Ticker.bidSize'),
#col('value.Ticker.ask'),
#col('value.Ticker.askSize')
)
query = JsonDF_cols_Data_Level_2.\
writeStream\
.outputMode("append")\
.format("console") \
.option("truncate", "false") \
.start()
query.awaitTermination()
谢谢!
答案 0 :(得分:0)
尝试使用格式为to_timestamp
的 from_unixtime(unix_timestamp())
(或) "yyyy-MM-dd'T'HH:mm:ss"
功能
df.withColumn("new_time", to_timestamp(col("time"),"yyyy-MM-dd'T'HH:mm:ss")).show(10,False)
#+--------------------------------+-------------------+
#|time |new_time |
#+--------------------------------+-------------------+
#|2020-04-21T11:28:40.321328+00:00|2020-04-21 11:28:40|
#+--------------------------------+-------------------+
#using date_format
df.withColumn("new_time", date_format(to_timestamp(col("time"),"yyyy-MM-dd'T'HH:mm:ss"),"yyyy-MM-dd HH:mm:ss")).show(10,False)
#+--------------------------------+-------------------+
#|time |new_time |
#+--------------------------------+-------------------+
#|2020-04-21T11:28:40.321328+00:00|2020-04-21 11:28:40|
#+--------------------------------+-------------------+
#using from_unixtime, unix_timestamp functions
df.withColumn("new_time", from_unixtime(unix_timestamp(col("time"),"yyyy-MM-dd'T'HH:mm:ss"),"yyyy-MM-dd HH:mm:ss")).show(10,False)
#+--------------------------------+-------------------+
#|time |new_time |
#+--------------------------------+-------------------+
#|2020-04-21T11:28:40.321328+00:00|2020-04-21 11:28:40|
#+--------------------------------+-------------------+
For Spark-3:
df.withColumn("new_time",to_timestamp((col('time').substr(1, 19)) ,"yyyy-MM-dd'T'HH:mm:ss")).show(10,False)