我正在尝试在Spark中执行时间戳差异,但它没有按预期工作。
以下是我试图
的方式import org.apache.spark.sql.functions.*
df = df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss"))
值
TimeStampHigh - 15:57:01
TimeStampLow - 00:11:57
它返回10:45:04
的结果
预期输出 - 15:45:04
我的另一种选择是使用Java实现转到UDF。
任何指针都会有所帮助。
答案 0 :(得分:2)
那是因为from_unixtime
(强调我的):
将unix epoch(1970-01-01 00:00:00 UTC)的秒数转换为表示给定格式的当前系统时区中的时间戳的时间戳的字符串
显然,您的系统或JVM未配置为使用UTC时间。
您应该执行以下操作之一:
-Duser.timezone=UTC
和spark.executor.extraJavaOptions
spark.driver.extraJavaOptions
)。spark.sql.session.timeZone
设置为使用适当的时区。示例:
scala> val df = Seq(("15:57:01", "00:11:57")).toDF("TimeStampHigh", "TimeStampLow")
df: org.apache.spark.sql.DataFrame = [TimeStampHigh: string, TimeStampLow: string]
scala> spark.conf.set("spark.sql.session.timeZone", "GMT-5") // Equivalent to your current settings
scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
| 15:57:01| 00:11:57| 10:45:04|
+-------------+------------+-------------+
scala> spark.conf.set("spark.sql.session.timeZone", "UTC") // With UTC
scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
| 15:57:01| 00:11:57| 15:45:04|
+-------------+------------+-------------+