Spark时间戳差异

时间:2018-04-17 12:47:49

标签: java apache-spark timestamp

我正在尝试在Spark中执行时间戳差异,但它没有按预期工作。

以下是我试图

的方式
import org.apache.spark.sql.functions.*
df = df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss"))

TimeStampHigh - 15:57:01
TimeStampLow - 00:11:57

它返回10:45:04的结果 预期输出 - 15:45:04

我的另一种选择是使用Java实现转到UDF。

任何指针都会有所帮助。

1 个答案:

答案 0 :(得分:2)

那是因为from_unixtime(强调我的):

  

将unix epoch(1970-01-01 00:00:00 UTC)的秒数转换为表示给定格式的当前系统时区中的时间戳的时间戳的字符串

显然,您的系统或JVM未配置为使用UTC时间。

您应该执行以下操作之一:

  • 配置JVM以使用适当的时区(-Duser.timezone=UTCspark.executor.extraJavaOptions spark.driver.extraJavaOptions)。
  • spark.sql.session.timeZone设置为使用适当的时区。

示例:

scala> val df = Seq(("15:57:01", "00:11:57")).toDF("TimeStampHigh", "TimeStampLow")
df: org.apache.spark.sql.DataFrame = [TimeStampHigh: string, TimeStampLow: string]

scala> spark.conf.set("spark.sql.session.timeZone", "GMT-5")  // Equivalent to your current settings

scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
|     15:57:01|    00:11:57|     10:45:04|
+-------------+------------+-------------+


scala> spark.conf.set("spark.sql.session.timeZone", "UTC")  // With UTC

scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
|     15:57:01|    00:11:57|     15:45:04|
+-------------+------------+-------------+