Spark Dataframe-如何根据ID和日期仅保留每个组的最新记录?

时间:2020-01-23 19:54:40

标签: dataframe date apache-spark pyspark

我有一个数据框:

DF:

1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25

如何仅保留每个组的最新记录? (在(1,2,3)上方有3个组。)

结果应为:

1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25

还尝试提高效率(例如,在具有1亿条记录的中等规模的群集上在短短的几分钟内完成操作),因此(如果需要)应该以最有效和正确的方式进行排序/排序。 >

2 个答案:

答案 0 :(得分:2)

您必须使用窗口功能。

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window

您必须按组和OrderBy来划分窗口,在pyspark脚本下面进行工作

from pyspark.sql.functions import *
from pyspark.sql.window import Window

schema = "Group int,time timestamp "

df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')


w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))

df.filter(df.Rank == 1).drop(df.Rank).show()


+-----+-------------------+
|Group|               time|
+-----+-------------------+
|    1|2016-11-18 14:47:05|
|    3|2016-10-12 17:24:25|
|    2|2016-10-12 22:24:25|
+-----+-------------------+ ```





答案 1 :(得分:-1)

对于以下情况,您可以使用here中所述的窗口函数:

scala> val in = Seq((1,"2016-10-12 18:24:25"),
     | (1,"2016-11-18 14:47:05"),
     | (2,"2016-10-12 21:24:25"),
     | (2,"2016-10-12 20:24:25"),
     | (2,"2016-10-12 22:24:25"),
     | (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val win = Window.partitionBy("id").orderBy('ts desc)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where('rank === 1).show(false)
+---+-------------------+----+
| id|                 ts|rank|
+---+-------------------+----+
|  1|2016-11-18 14:47:05|   1|
|  3|2016-10-12 17:24:25|   1|
|  2|2016-10-12 22:24:25|   1|
+---+-------------------+----+