旅行
id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10
使用Pandas:
trip['time_diff'] = np.where(trip['id'] == trip['id'].shift(-1),
trip['timestamp'].shift(-1) - trip['timestamp']/1000,
None)
trip['time_diff'] = pd.to_numeric(trip['time_diff'])
我在Pyspark做了这个操作,但没有任何作用,我用火花编程一周,然后我仍然无法使用窗口。
from pyspark.sql.types import *
from pyspark.sql import window
from pyspark.sql import functions as F
my_window = Window.partition('id').orderBy('timestamp').rowsBetween(0, 1)
timeFmt = "yyyy-MM-dd HH:mm:ss"
time_diff = (F.unix_timestamp(trip.timestamp, format=timeFmt).cast("long") -
F.unix_timestamp(trip.timestamp, format=timeFmt).over(my_window).cast("long"))
trip = trip.withColumn('time_diff', time_diff)
我想知道是不是这样做!!如果不是如何将此操作转换为Pyspark?
结果应该是
id, timestamp, diff_time
1008, 2003-11-03 15:00:31, 127
1008, 2003-11-03 15:02:38, 26
1008, 2003-11-03 15:03:04, 896
1008, 2003-11-03 15:18:00, None
1009, 2003-11-03 22:00:00, 173
1009, 2003-11-03 22:02:53, 51
1009, 2003-11-03 22:03:44, 956776
1009, 2003-11-14 10:00:00, .....
1009, 2003-11-14 10:02:02, .....
1009, 2003-11-14 10:03:10, .....
答案 0 :(得分:2)
您可以使用lead
功能并计算时差。以下是您想要的:
val interdf = spark.sql("select id, timestamp, lead(timestamp) over (partition by id order by timestamp) as next_ts from data")
interdf.createOrReplaceTempView("interdf")
spark.sql("select id, timestamp, next_ts, unix_timestamp(next_ts) - unix_timestamp(timestamp) from interdf").show()
如果你想避免使用spark-sql,你可以通过导入相关的函数
来做同样的事情import org.apache.spark.sql.functions.lead
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("id").orderBy("timestamp")
相关的Python代码:
from pyspark.sql import Window
from pyspark.sql.functions import abs
window = Window.partitionBy("id").orderBy("timestamp")
diff = col("timestamp").cast("long") - lead("timestamp", 1).over(window).cast("long")
df = df.withColumn("diff", diff)
df = df.withColumn('diff', abs(df.diff))