使用Pyspark在当前行和下一行之间的持续时间

时间:2018-03-26 15:02:35

标签: pyspark

旅行

id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44 
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10

使用Pandas:

trip['time_diff'] = np.where(trip['id'] == trip['id'].shift(-1),
                             trip['timestamp'].shift(-1) - trip['timestamp']/1000,
                             None)

trip['time_diff'] = pd.to_numeric(trip['time_diff'])

我在Pyspark做了这个操作,但没有任何作用,我用火花编程一周,然后我仍然无法使用窗口。

from pyspark.sql.types import *
from pyspark.sql import window
from pyspark.sql import functions as F

my_window = Window.partition('id').orderBy('timestamp').rowsBetween(0, 1)

timeFmt = "yyyy-MM-dd HH:mm:ss"

time_diff = (F.unix_timestamp(trip.timestamp, format=timeFmt).cast("long")  - 
             F.unix_timestamp(trip.timestamp, format=timeFmt).over(my_window).cast("long")) 

trip = trip.withColumn('time_diff', time_diff)

我想知道是不是这样做!!如果不是如何将此操作转换为Pyspark?

delta_time

error

结果应该是

id, timestamp, diff_time
1008, 2003-11-03 15:00:31, 127
1008, 2003-11-03 15:02:38, 26
1008, 2003-11-03 15:03:04, 896
1008, 2003-11-03 15:18:00, None
1009, 2003-11-03 22:00:00, 173
1009, 2003-11-03 22:02:53, 51
1009, 2003-11-03 22:03:44, 956776
1009, 2003-11-14 10:00:00, .....
1009, 2003-11-14 10:02:02, .....
1009, 2003-11-14 10:03:10, .....

1 个答案:

答案 0 :(得分:2)

您可以使用lead功能并计算时差。以下是您想要的:

val interdf = spark.sql("select id, timestamp, lead(timestamp) over (partition by id order by timestamp) as next_ts from data")
interdf.createOrReplaceTempView("interdf")
spark.sql("select id, timestamp, next_ts, unix_timestamp(next_ts) - unix_timestamp(timestamp) from interdf").show()

如果你想避免使用spark-sql,你可以通过导入相关的函数

来做同样的事情
import org.apache.spark.sql.functions.lead
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("id").orderBy("timestamp")

相关的Python代码:

from pyspark.sql import Window
from pyspark.sql.functions import abs

window = Window.partitionBy("id").orderBy("timestamp")

diff = col("timestamp").cast("long") - lead("timestamp", 1).over(window).cast("long")
df = df.withColumn("diff", diff)
df = df.withColumn('diff', abs(df.diff))

结果:

Result