我有一个DataFrame,其中包含以下详细信息。
|id|Name|Country|version|
|1 |Jack|UK |new |
|1 |Jack|USA |old |
|2 |Rose|Germany|new |
|3 |Sam |France |old |
我想创建一个DataFrame,如果数据是基于“id”的重复,则会选择新 版本而不是旧 版本如此
|id|Name|Country|version|
|1 |Jack|UK |new |
|2 |Rose|Germany|new |
|3 |Sam |France |old |
在Java / Spark中执行此操作的最佳方法是什么,还是必须使用某种嵌套SQL查询?
简化的SQL版本如下所示:
WITH new_version AS (
SELECT
ad.id
,ad.name
,ad.country
,ad.version
FROM allData ad
WHERE ad.version = 'new'
),
old_version AS (
SELECT
ad.id
,ad.name
,ad.country
,ad.version
FROM allData ad
LEF JOIN new_version nv on nv.id = ad.id
WHERE ad.version = 'old'
AND nv.id is null
),
SELECT id, name, country, version FROM new_version
UNION ALL
SELECT id, name, country, version FROM old_version
答案 0 :(得分:1)
假设您有dataframe
+---+----+-------+-------+
|id |Name|Country|version|
+---+----+-------+-------+
|1 |Jack|UK |new |
|1 |Jack|USA |old |
|2 |Rose|Germany|new |
|3 |Sam |France |old |
+---+----+-------+-------+
使用
创建val df = Seq(
("1","Jack","UK","new"),
("1","Jack","USA","old"),
("2","Rose","Germany","new"),
("3","Sam","France","old")
).toDF("id","Name","Country","version")
您可以实现 sql查询的要求删除old
所有重复的 id 行作为版本列使用Window
,rank
,filter
和drop
功能,如下所示
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("id").orderBy("version")
import org.apache.spark.sql.functions._
df.withColumn("rank", rank().over(windowSpec))
.filter(!(col("version") === "old" && col("rank") > 1))
.drop("rank")
你应该得到最终dataframe
+---+----+-------+-------+
|id |Name|Country|version|
+---+----+-------+-------+
|3 |Sam |France |old |
|1 |Jack|UK |new |
|2 |Rose|Germany|new |
+---+----+-------+-------+
答案 1 :(得分:1)
对于旧版本的Spark,您可以将orderBy
与groupBy
结合使用。根据此question的答案,如果数据框在该列之后排序,则应在groupBy
之后保留顺序。因此,以下内容应该有效(请注意,orderBy
和id
列上都有version
):
val df2 = df.orderBy("id", "version")
.groupBy("id")
.agg(first("Name").as("Name"), first("Country").as("Country"), first("version").as("version"))
这将产生以下结果
+---+----+-------+-------+
| id|Name|Country|version|
+---+----+-------+-------+
| 3| Sam| France| old|
| 1|Jack| UK| new|
| 2|Rose|Germany| new|
+---+----+-------+-------+