我有一个数据框:
DF:
1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
如何仅保留每个组的最新记录? (在(1,2,3)上方有3个组。)
结果应为:
1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
还尝试提高效率(例如,在具有1亿条记录的中等规模的群集上在短短的几分钟内完成操作),因此(如果需要)应该以最有效和正确的方式进行排序/排序。 >
答案 0 :(得分:2)
您必须使用窗口功能。
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window
您必须按组和OrderBy来划分窗口,在pyspark脚本下面进行工作
from pyspark.sql.functions import *
from pyspark.sql.window import Window
schema = "Group int,time timestamp "
df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')
w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))
df.filter(df.Rank == 1).drop(df.Rank).show()
+-----+-------------------+
|Group| time|
+-----+-------------------+
| 1|2016-11-18 14:47:05|
| 3|2016-10-12 17:24:25|
| 2|2016-10-12 22:24:25|
+-----+-------------------+ ```
答案 1 :(得分:-1)
对于以下情况,您可以使用here中所述的窗口函数:
scala> val in = Seq((1,"2016-10-12 18:24:25"),
| (1,"2016-11-18 14:47:05"),
| (2,"2016-10-12 21:24:25"),
| (2,"2016-10-12 20:24:25"),
| (2,"2016-10-12 22:24:25"),
| (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("id").orderBy('ts desc)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where('rank === 1).show(false)
+---+-------------------+----+
| id| ts|rank|
+---+-------------------+----+
| 1|2016-11-18 14:47:05| 1|
| 3|2016-10-12 17:24:25| 1|
| 2|2016-10-12 22:24:25| 1|
+---+-------------------+----+