删除基于Java DataFrame的重复行

时间:2018-02-05 15:28:01

标签: java scala apache-spark dataframe apache-spark-sql

我有一个DataFrame,其中包含以下详细信息。

|id|Name|Country|version|
|1 |Jack|UK     |new    |
|1 |Jack|USA    |old    |
|2 |Rose|Germany|new    |
|3 |Sam |France |old    |

我想创建一个DataFrame,如果数据是基于“id”的重复,则会选择 版本而不是 版本如此

|id|Name|Country|version|
|1 |Jack|UK     |new    |
|2 |Rose|Germany|new    |
|3 |Sam |France |old    |

在Java / Spark中执行此操作的最佳方法是什么,还是必须使用某种嵌套SQL查询?

简化的SQL版本如下所示:

WITH new_version AS (
    SELECT
      ad.id
      ,ad.name
      ,ad.country
      ,ad.version
    FROM allData ad
    WHERE ad.version = 'new'
),
old_version AS (
    SELECT
      ad.id
      ,ad.name
      ,ad.country
      ,ad.version
    FROM allData ad
    LEF JOIN new_version nv on nv.id = ad.id
    WHERE ad.version = 'old'
      AND nv.id is null
),

SELECT id, name, country, version FROM new_version
UNION ALL
SELECT id, name, country, version FROM old_version

2 个答案:

答案 0 :(得分:1)

假设您有dataframe

+---+----+-------+-------+
|id |Name|Country|version|
+---+----+-------+-------+
|1  |Jack|UK     |new    |
|1  |Jack|USA    |old    |
|2  |Rose|Germany|new    |
|3  |Sam |France |old    |
+---+----+-------+-------+

使用

创建
val df = Seq(
  ("1","Jack","UK","new"),
  ("1","Jack","USA","old"),
  ("2","Rose","Germany","new"),
  ("3","Sam","France","old")
).toDF("id","Name","Country","version")

您可以实现 sql查询的要求删除old所有重复的 id 行作为版本使用Windowrankfilterdrop功能,如下所示

import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("id").orderBy("version")
import org.apache.spark.sql.functions._
df.withColumn("rank", rank().over(windowSpec))
  .filter(!(col("version") === "old" && col("rank") > 1))
  .drop("rank")

你应该得到最终dataframe

+---+----+-------+-------+
|id |Name|Country|version|
+---+----+-------+-------+
|3  |Sam |France |old    |
|1  |Jack|UK     |new    |
|2  |Rose|Germany|new    |
+---+----+-------+-------+

答案 1 :(得分:1)

对于旧版本的Spark,您可以将orderBygroupBy结合使用。根据此question的答案,如果数据框在该列之后排序,则应在groupBy之后保留顺序。因此,以下内容应该有效(请注意,orderByid列上都有version):

val df2 = df.orderBy("id", "version")
  .groupBy("id")
  .agg(first("Name").as("Name"), first("Country").as("Country"), first("version").as("version"))

这将产生以下结果

+---+----+-------+-------+
| id|Name|Country|version|
+---+----+-------+-------+
|  3| Sam| France|    old|
|  1|Jack|     UK|    new|
|  2|Rose|Germany|    new|
+---+----+-------+-------+