我有一个像这样的火花数据集:
(a, ([1 a a1 a2 a3], [2 a a4 a5 a6]) ),
(b, ([3 b b1 b2 b3], [4 b b4 b5 b6], [5 b b7 b8 b9]) ),
(c, ([6 c c1 c2 c3]) )
我想在列表或数组中按id分组所有行:
{{1}}
我使用map用右键输出键/值对,但是在构建最终键/数组时遇到了麻烦。
有人可以帮忙吗?
答案 0 :(得分:5)
怎么样:
import org.apache.spark.sql.functions._
df.withColumn("combined",array("key","id","val1","val2","val3")).groupby("id").agg(collect_list($"combined"))
Array函数将列转换为列数组,然后将其转换为带collect_list
的简单groupby答案 1 :(得分:0)
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.functions._
val assembler = new VectorAssembler()
.setInputCols(Array("key", "id", "val1", "val2", "val3","score"))
.setOutputCol("combined")
val dfRes = assembler.transform(df).groupby("id").agg(collect_list($"combined"))
答案 2 :(得分:0)