在Spark中添加12小时到datetime列

时间:2016-11-30 07:59:29

标签: apache-spark apache-spark-sql

我试过搜索了很多,但只能在Spark SQL中找到add_month函数,所以最后在这里打开一个新线程。非常感谢有人提供的任何帮助。

我正在尝试使用sqlContext将小时12,24和48添加到Spark SQL中的日期列。我使用1.6.1版本的Spark,我需要这样的东西:

SELECT N1.subject_id, '12-HOUR' AS notes_period, N1.chartdate_start, N2.chartdate, N2.text
FROM NOTEEVENTS N2,
(SELECT subject_id, MIN(chartdate) chartdate_start
  FROM NOTEEVENTS
  WHERE subject_id = 283
  AND category != 'Discharge summary'
GROUP BY subject_id) N1
WHERE N2.subject_id = N1.subject_id
and n2.chartdate < n1.chartdate_start + interval '1 hour' * 12

请注意最后一个用PostgreSql编写的子句,这是我在Spark SQL中需要的。我真的很感激我能得到的任何帮助。

感谢。

3 个答案:

答案 0 :(得分:8)

目前还没有这样的功能,但您可以编写UDF:

sqlContext.udf.register("add_hours", (datetime : Timestamp, hours : Int) => {
    new Timestamp(datetime.getTime() + hours * 60 * 60 * 1000 )
});

例如:

SELECT N1.subject_id, '12-HOUR' AS notes_period, N1.chartdate_start, N2.chartdate, N2.text
    FROM NOTEEVENTS N2,
    (SELECT subject_id, MIN(chartdate) chartdate_start
      FROM NOTEEVENTS
      WHERE subject_id = 283
      AND category != 'Discharge summary'
    GROUP BY subject_id) N1
    WHERE N2.subject_id = N1.subject_id
    and n2.chartdate < add_hours(n1.chartdate_start, 12)

您还可以使用unix_timestamp函数计算新日期。在我看来,它的可读性较差,但可以使用由Anton Okolnychyi启发的WholeStage Code Gen. Code其他答案

import org.apache.spark.sql.functions._
val addMonths = (datetime : Column, hours : Column) => {
     from_unixtime(unix_timestamp(n1.chartdate_start) + 12 * 60 * 60)
}

答案 1 :(得分:7)

如何使用unix_timestamp()函数将日期转换为时间戳(以秒为单位),然后将hours * 60 * 60添加到其中?

然后你的情况会是这样的:

unix_timestamp(n2.chartdate) < (unix_timestamp(n1.chartdate_start) + 12 * 60 * 60)

答案 2 :(得分:6)

与PostgreSQL相同,您可以使用INTERVAL。在SQL中

spark.sql("""SELECT current_timestamp() AS now, 
                    current_timestamp() + INTERVAL 12 HOURS AS now_plus_twelve"""
).show(false)
+-----------------------+-----------------------+
|now                    |now_plus_twelve        |
+-----------------------+-----------------------+
|2017-12-14 10:49:15.115|2017-12-14 22:49:15.115|
+-----------------------+-----------------------+

使用Dataset - Scala:

import org.apache.spark.sql.functions.{current_timestamp, expr}

spark.range(1)
  .select(
    current_timestamp as "now", 
    current_timestamp + expr("INTERVAL 12 HOURS") as "now_plus_twelve"
  ).show(false)
+-----------------------+-----------------------+
|now                    |now_plus_twelve        |
+-----------------------+-----------------------+
|2017-12-14 10:56:59.185|2017-12-14 22:56:59.185|
+-----------------------+-----------------------+

的Python:

from pyspark.sql.functions import current_timestamp, expr

(spark.range(1).select(
    current_timestamp().alias("now"), 
    (current_timestamp() + expr("INTERVAL 12 HOURS")).alias("now_plus_twelve")))