所以我需要计算两个日期之间的差。我知道PySpark SQL确实支持DATEDIFF
,但仅支持一天。我做了一个计算差异的函数,但是我没有输出。代码如下:
...
logRowsDF.createOrReplaceTempView("taxiTable")
#first way
spark.registerFunction("test", lambda x,y: ((dt.strptime(x, '%Y-%m-%d %H:%M:%S') - dt.strptime(y, '%Y-%m-%d %H:%M:%S')).days * 24 * 60) + ((dt.strptime(x, '%Y-%m-%d %H:%M:%S') - dt.strptime(y, '%Y-%m-%d %H:%M:%S')).seconds/60))
#second
spark.registerFunction("test", lambda x,y: countTime(x,y))
#third
diff = udf(countTime)
#trying to call that function that way
listIpsDF = spark.sql('SELECT diff(pickup,dropoff) AS TIME FROM taxiTable')
功能:
def countTime(time1, time2):
fmt = '%Y-%m-%d %H:%M:%S'
d1 = dt.strptime(time1, fmt)
d2 = dt.strptime(time2, fmt)
diff = d2 -d1
diff_minutes = (diff.days * 24 * 60) + (diff.seconds/60)
return str(diff_minutes)
它根本不起作用。你能帮我吗?
一个例子:
+-------------------+-------------------+
| pickup| dropoff|
+-------------------+-------------------+
|2018-01-01 00:21:05|2018-01-01 00:24:23|
|2018-01-01 00:44:55|2018-01-01 01:03:05|
| ... |
+-------------------+-------------------+
预期产量(以分钟为单位):
+-------------------+
| datediff |
+-------------------+
| 3.3 |
| 18.166666666666668|
| ... |
+-------------------+
答案 0 :(得分:1)
实际上,我不确定您的错误在哪里,因为某些示例代码没有意义(例如,您注册了一个名为“ test”的函数,但在未注册的sql语句中使用了diff函数- >应该会导致错误消息)。无论如何,请在下面找到您的代码的有效示例:
from pyspark.sql.functions import udf
from datetime import datetime as dt
l = [('2018-01-01 00:21:05','2018-01-01 00:24:23')
,('2018-01-01 00:44:55', '2018-01-01 01:03:05')
]
df = spark.createDataFrame(l,['begin','end'])
df.registerTempTable('test')
def countTime(time1, time2):
fmt = '%Y-%m-%d %H:%M:%S'
d1 = dt.strptime(time1, fmt)
d2 = dt.strptime(time2, fmt)
diff = d2 - d1
diff_minutes = (diff.days * 24 * 60) + (diff.seconds/60)
return str(diff_minutes)
diff = udf(countTime)
sqlContext.registerFunction("diffSQL", lambda x, y: countTime(x,y))
print('column expression udf works')
df.withColumn('bla', diff(df.begin,df.end)).show()
print('sql udf works')
spark.sql('select diffSQL(begin,end) from test').show()
示例的输出:
column expression udf works
+-------------------+-------------------+------------------+
| begin| end| bla|
+-------------------+-------------------+------------------+
|2018-01-01 00:21:05|2018-01-01 00:24:23| 3.3|
|2018-01-01 00:44:55|2018-01-01 01:03:05|18.166666666666668|
+-------------------+-------------------+------------------+
sql udf works
+-------------------+
|diffSQL(begin, end)|
+-------------------+
| 3.3|
| 18.166666666666668|
+-------------------+