如何在Spark SQL中将时间戳列转换为毫秒长列

时间:2019-06-18 12:40:01

标签: apache-spark apache-spark-sql

Spark SQL中将Timestamp列转换为毫秒时间戳Long列的最短,最有效的方法是什么?

这是从时间戳到毫秒的转换示例

scala> val ts = spark.sql("SELECT now() as ts")
ts: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> ts.show(false)
+-----------------------+                                                       
|ts                     |
+-----------------------+
|2019-06-18 12:32:02.41 |
+-----------------------+

scala> val tss = ts.selectExpr(
 |   "ts",
 |   "BIGINT(ts) as seconds_ts",
 |   "BIGINT(ts) * 1000 + BIGINT(date_format(ts, 'S')) as millis_ts"
 | )
tss: org.apache.spark.sql.DataFrame = [ts: timestamp, seconds_ts: bigint ... 1 more field]

scala> tss.show(false)
+----------------------+----------+-------------+                               
|ts                    |seconds_ts|millis_ts    |
+----------------------+----------+-------------+
|2019-06-18 12:32:02.41|1560861122|1560861122410|
+----------------------+----------+-------------+

如您所见,从时间戳获取毫秒的最直接方法不起作用-强制转换为long返回秒,但是保留了时间戳中的毫秒信息。

我发现提取毫秒信息的唯一方法是使用date_format函数,这并不像我期望的那么简单。

有人知道比Timestamp列中的毫秒数UNIX时间更简单的方法吗?

1 个答案:

答案 0 :(得分:0)

根据 Spark 的 DateTimeUtils 上的代码:

<块引用>

“时间戳在外部公开为 java.sql.Timestamp,内部存储为 longs,能够以微秒精度存储时间戳。”

因此,如果您定义一个以 java.sql.Timestamp 作为输入的 UDF,您可以简单地调用 getTime 以获得以毫秒为单位的 Long。

val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)

将此应用于各种时间戳:

val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
  .toDF("timestampString")
  .withColumn("timestamp", to_timestamp(col("timestampString")))
  .withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
  .withColumn("timestampCastAsLong", col("timestamp").cast(LongType))

df.printSchema()
df.show(false)

// returns
root
 |-- timestampString: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampConversionToLong: long (nullable = false)
 |-- timestampCastAsLong: long (nullable = true)

+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString        |timestamp              |timestampConversionToLong|timestampCastAsLong|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00    |1484733600000            |1484733600         |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111            |1484733600         |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110            |1484733600         |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1  |1484733600100            |1484733600         |
+-----------------------+-----------------------+-------------------------+-------------------+

请注意,“timestampCastAsLong”列仅表明直接转换为 Long 不会以毫秒为单位返回所需的结果,而只会以秒为单位返回。