PySpark SQL中日期之间的差异

时间:2018-11-15 23:22:10

标签: sql-server pyspark bigdata

所以我需要计算两个日期之间的差。我知道PySpark SQL确实支持DATEDIFF,但仅支持一天。我做了一个计算差异的函数,但是我没有输出。代码如下:

     ...
logRowsDF.createOrReplaceTempView("taxiTable")
#first way
spark.registerFunction("test", lambda x,y: ((dt.strptime(x, '%Y-%m-%d %H:%M:%S') - dt.strptime(y, '%Y-%m-%d %H:%M:%S')).days * 24 * 60) + ((dt.strptime(x, '%Y-%m-%d %H:%M:%S') - dt.strptime(y, '%Y-%m-%d %H:%M:%S')).seconds/60))
#second
spark.registerFunction("test", lambda x,y: countTime(x,y))
#third
diff = udf(countTime)
#trying to call that function that way
listIpsDF = spark.sql('SELECT diff(pickup,dropoff) AS TIME FROM taxiTable')

功能:

def countTime(time1, time2):
    fmt = '%Y-%m-%d %H:%M:%S'
    d1 = dt.strptime(time1, fmt)
    d2 = dt.strptime(time2, fmt)
    diff = d2 -d1
    diff_minutes = (diff.days * 24 * 60) + (diff.seconds/60)
    return str(diff_minutes)

它根本不起作用。你能帮我吗?

一个例子:

+-------------------+-------------------+
|             pickup|            dropoff|
+-------------------+-------------------+
|2018-01-01 00:21:05|2018-01-01 00:24:23|
|2018-01-01 00:44:55|2018-01-01 01:03:05|
|                  ...                  |
+-------------------+-------------------+

预期产量(以分钟为单位):

+-------------------+
|    datediff       |
+-------------------+
|        3.3        |
| 18.166666666666668|
|        ...        |
+-------------------+

1 个答案:

答案 0 :(得分:1)

实际上,我不确定您的错误在哪里,因为某些示例代码没有意义(例如,您注册了一个名为“ test”的函数,但在未注册的sql语句中使用了diff函数- >应该会导致错误消息)。无论如何,请在下面找到您的代码的有效示例:

from pyspark.sql.functions import udf
from datetime import datetime as dt

l = [('2018-01-01 00:21:05','2018-01-01 00:24:23')
,('2018-01-01 00:44:55', '2018-01-01 01:03:05')
]

df = spark.createDataFrame(l,['begin','end'])
df.registerTempTable('test')

def countTime(time1, time2):
    fmt = '%Y-%m-%d %H:%M:%S'
    d1 = dt.strptime(time1, fmt)
    d2 = dt.strptime(time2, fmt)
    diff = d2 - d1
    diff_minutes = (diff.days * 24 * 60) + (diff.seconds/60)
    return str(diff_minutes)

diff = udf(countTime)
sqlContext.registerFunction("diffSQL", lambda x, y: countTime(x,y))

print('column expression udf works')
df.withColumn('bla', diff(df.begin,df.end)).show()
print('sql udf works')
spark.sql('select diffSQL(begin,end) from test').show()

示例的输出:

column expression udf works
+-------------------+-------------------+------------------+ 
|              begin|                end|               bla| 
+-------------------+-------------------+------------------+ 
|2018-01-01 00:21:05|2018-01-01 00:24:23|               3.3| 
|2018-01-01 00:44:55|2018-01-01 01:03:05|18.166666666666668| 
+-------------------+-------------------+------------------+ 
sql udf works 
+-------------------+ 
|diffSQL(begin, end)| 
+-------------------+ 
|                3.3| 
| 18.166666666666668|
+-------------------+