两个日期时间之间的Spark / Hive小时

时间:2016-05-11 19:19:34

标签: hadoop apache-spark hive pyspark

我想知道如何精确地获得两个日期时间之间的小时数。

有一个名为datediff的函数,我可以使用它来获取天数然后转换为小时,但这不如我想要的那样精确

我希望在datediff之后建模的示例:

>>> df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
>>> df.select(hourdiff(df.d2, df.d1).alias('diff')).collect()
[Row(diff=22)]

2 个答案:

答案 0 :(得分:0)

尝试使用UDF以下是示例代码,您可以修改为UDF,根据需要返回任何粒度。

from pyspark.sql.functions import udf, col
from datetime import datetime, timedelta
from pyspark.sql.types import LongType
def timediff_x():
    def _timediff_x(date1, date2):
        date11 = datetime.strptime(date1, '%Y-%m-%d %H:%M:%S')
        date22 = datetime.strptime(date2, '%Y-%m-%d %H:%M:%S')
        return (date11 - date22).days
    return udf(_timediff_x, LongType())

df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-25 19:15:00')], ['d1', 'd2'])
df.select(timediff_x()(col("d2"), col("d1"))).show() 

+----------------------------+
|PythonUDF#_timediff_x(d2,d1)|
+----------------------------+
|                           6|
+----------------------------+

答案 1 :(得分:0)

如果您的列属于StringType()类型,则可以使用以下问题的答案:

Spark Scala: DateDiff of two columns by hour or minute

但是,如果列的类型为from pyspark.sql.functions import * diffCol = unix_timestamp(col('d1'), 'yyyy-MM-dd HH:mm:ss') - unix_timestamp(col('d2'), 'yyyy-MM-dd HH:mm:ss') df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2']) df2 = df.withColumn('diff_secs', diffCol) ,则使用内置的functions

可以选择比定义UDF更容易的选项
Sakibs-MacBook-Pro:BluFireLabs SakibArRahman$ git status
On branch feature/initial-change
Your branch is ahead of 'origin/feature/initial-change' by 1 commit.
  (use "git push" to publish your local commits)
nothing to commit, working directory clean