有条不紊地删除spark数据集中重复的重复行

时间:2017-07-13 12:36:09

标签: apache-spark apache-spark-sql

我想实现以下目标:

val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS


+---+---+---+
| _1| _2| _3|
+---+---+---+
|  a|  x| 20|
|  z|  x| 10|
|  b|  y|  7|
|  z|  y|  5|
|  c|  w|  1|
|  z|  w|  2|
+---+---+---+

应该简化为

val df2 = Seq(("a","x",30),("b","y",12),("c","w",3)).toDS

+---+---+---+
| _1| _2| _3|
+---+---+---+
|  a|  x| 30|
|  b|  y| 12|
|  c|  w|  3|
+---+---+---+

我知道dropDuplicates()命令及其选项。但是,对于我想要实现的目标,这不起作用。不知何故,必须根据列_2检测重复项,然后必须使用z中的_1条目删除始终行,并将其_3值添加到保留的_3列。

提前谢谢。

1 个答案:

答案 0 :(得分:0)

根据您的问题,这就是您要找的内容

import spark.implicits._
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS

val resultDf = df1.groupBy("_2").agg(collect_list("_1")(0).as("_1"), sum("_3").as("_3"))

输出:

+---+---+---+
| _2| _1| _3|
+---+---+---+
|  x|  a| 30|
|  w|  c|  3|
|  y|  b| 12|
+---+---+---+

您将获得结果,但订单无法保证。