从Spark数据框中选择最新记录

时间:2019-04-10 14:57:57

标签: apache-spark-sql

我让DataDrame看起来像这样:

+-------+---------+
|email  |timestamp|
+-------+---------+
|x@y.com|        1|
|y@m.net|        2|
|z@c.org|        3|
|x@y.com|        4|
|y@m.net|        5|
|    .. |       ..|
+-------+---------+

对于我要保留最新记录的每封电子邮件,结果将是:

+-------+---------+
|email  |timestamp|
+-------+---------+
|x@y.com|        4|
|y@m.net|        5|
|z@c.org|        3|
|    .. |       ..|
+-------+---------+

我该怎么做?我是Spark和DataFrame的新手。

1 个答案:

答案 0 :(得分:1)

这是应该与Spark SQL一起使用的常规ANSI SQL查询:

SELECT email, timestamp
FROM
(
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY email ORDER BY timestamp DESC) rn
    FROM yourTable t
) t
WHERE rn = 1;

对于PySpark数据帧代码,请尝试以下操作:

from pyspark.sql.window import Window

df = yourDF
    .withColumn("rn", F.row_number()
        .over(Window.partitionBy("email")
        .orderBy(F.col("timestamp").desc())

df = df.filter(F.col("rn") == 1).drop("rn")
df.show()