我想实现以下目标:
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 20|
| z| x| 10|
| b| y| 7|
| z| y| 5|
| c| w| 1|
| z| w| 2|
+---+---+---+
应该简化为
val df2 = Seq(("a","x",30),("b","y",12),("c","w",3)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 30|
| b| y| 12|
| c| w| 3|
+---+---+---+
我知道dropDuplicates()
命令及其选项。但是,对于我想要实现的目标,这不起作用。不知何故,必须根据列_2
检测重复项,然后必须使用z
中的_1
条目删除始终行,并将其_3
值添加到保留的_3
列。
提前谢谢。
答案 0 :(得分:0)
根据您的问题,这就是您要找的内容
import spark.implicits._
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
val resultDf = df1.groupBy("_2").agg(collect_list("_1")(0).as("_1"), sum("_3").as("_3"))
输出:
+---+---+---+
| _2| _1| _3|
+---+---+---+
| x| a| 30|
| w| c| 3|
| y| b| 12|
+---+---+---+
您将获得结果,但订单无法保证。