我正在处理带有纳秒级时间戳的数据,并试图将字符串转换为时间戳格式。
“时间”列如下所示:
+---------------+
| Time |
+---------------+
|091940731349000|
|092955002327000|
|092955004088000|
+---------------+
我想将其投射到:
+------------------+
| Timestamp |
+------------------+
|09:19:40.731349000|
|09:29:55.002327000|
|09:29:55.004088000|
+------------------+
根据我在网上找到的内容,我不需要使用udf来执行此操作,并且应该有一个本机函数可以使用。
我尝试过cast
和to_timestamp
,但得到了'null'值:
df_new = df.withColumn('Timestamp', df.Time.cast("timestamp"))
df_new.select('Timestamp').show()
+---------+
|Timestamp|
+---------+
| null|
| null|
+---------+
答案 0 :(得分:3)
您的代码中有两个问题:
最接近所需输出的是将输入转换为符合JDBC的java.sql.Timestamp
格式:
from pyspark.sql.functions import col, regexp_replace
df = spark.createDataFrame(
["091940731349000", "092955002327000", "092955004088000"],
"string"
).toDF("time")
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"1970-01-01 $1:$2:$3.$4"
).cast("timestamp").alias("time")).show(truncate = False)
# +--------------------------+
# |time |
# +--------------------------+
# |1970-01-01 09:19:40.731349|
# |1970-01-01 09:29:55.002327|
# |1970-01-01 09:29:55.004088|
# +--------------------------+
如果只想跳过字符串,将输出限制为:
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"$1:$2:$3.$4"
).alias("time")).show(truncate = False)
# +------------------+
# |time |
# +------------------+
# |09:19:40.731349000|
# |09:29:55.002327000|
# |09:29:55.004088000|
# +------------------+