我的数据集包含多个车轮记录的速度,作为时间的函数。每辆车都有一个特定的身份证。
数据如下:
+-----------------+-----------+------+
| timestamp| ID| speed|
+-----------------+-----------+------+
|1.485320164625E12|-2140210972|139.25|
| 1.48532016475E12|-2140210972| 139.5|
|1.485320164875E12|-2140210972| 140.0|
| 1.485320165E12|-2140210972| 141.5|
|1.485320165125E12|-2140210972| 142.0|
| 1.48532016525E12|-2140210972|141.75|
|1.485320165375E12|-2140210972|141.25|
| 1.4853201655E12|-2140210972| 142.5|
|1.485320165625E12|-2140210972|142.75|
| 1.48532016575E12|-2140210972| 143.0|
|1.485320165875E12|-2140210972|143.75|
| 1.485320166E12|-2140210972| 144.5|
|1.485320166125E12|-2140210972| 144.0|
| 1.48532016625E12|-2140210972|144.75|
|1.485320166375E12|-2140210972| 144.5|
| 1.4853201665E12|-2140210972| 145.5|
|1.485320166625E12|-2140210972|145.75|
| 1.48532016675E12|-2140210972|144.25|
|1.485320166875E12|-2140210972|145.25|
| 1.485320167E12|-2140210972| 144.5|
+-----------------+-----------+------+
only showing top 20 rows
我想找到最大速度并获得此最大值的第一个时间戳。
我尝试了以下内容:
from pyspark.sql import functions as F
df.groupBy("ID").agg(F.first(F.max("speed"))).show()
但是我收到以下错误:
'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query'
我想过做类似的事情:
win = Window.partitionBy("ID", "speed").orderBy("timestamp")
F.rank(df.speed).over(win)
F.max(df.speed).over(Window.partitionBy("ID")
result = df.filter(df.speed == max(speed) (for rank ==1)
但对于这么简单的操作来说,这似乎过于复杂,不是吗?