我的数据框有两列,数据如下
+----+-----------------+
|acct| device|
+----+-----------------+
| B| List(3, 4)|
| C| List(3, 5)|
| A| List(2, 6)|
| B|List(3, 11, 4, 9)|
| C| List(5, 6)|
| A|List(2, 10, 7, 6)|
+----+-----------------+
我需要结果如下
+----+-----------------+
|acct| device|
+----+-----------------+
| B|List(3, 4, 11, 9)|
| C| List(3, 5, 6)|
| A|List(2, 6, 7, 10)|
+----+-----------------+
我尝试如下,但似乎无法正常工作
df.groupBy("acct").agg(concat("device"))
df.groupBy("acct").agg(collect_set("device"))
请告诉我如何使用Scala实现这一目标?
答案 0 :(得分:1)
您可以从爆炸device
列开始,然后继续操作 - 但请注意,它可能无法保留列表的顺序(无论如何都不保证在任何组中):
val result = df.withColumn("device", explode($"device"))
.groupBy("acct")
.agg(collect_set("device"))
result.show(truncate = false)
// +----+-------------------+
// |acct|collect_set(device)|
// +----+-------------------+
// |B |[9, 3, 4, 11] |
// |C |[5, 6, 3] |
// |A |[2, 6, 10, 7] |
// +----+-------------------+
答案 1 :(得分:0)
您可以尝试使用collect_set
和Window
。在你的情况下:
df.withColumn("device", collect_set("device").over(Window.partitionBy("acct")))
答案 2 :(得分:0)
可能的另一个选项比explode
选项更好:创建自己的 UserDefinedAggregationFunction ,将列表合并到不同的集合中。
您必须按如下方式延长UserDefinedAggregateFunction
:
class MergeListsUDAF extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(Seq(StructField("a", ArrayType(IntegerType))))
override def bufferSchema: StructType = inputSchema
override def dataType: DataType = ArrayType(IntegerType)
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = buffer.update(0, mutable.Seq[Int]())
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val existing = buffer.getAs[mutable.Seq[Int]](0)
val newList = input.getAs[mutable.Seq[Int]](0)
val result = (existing ++ newList).distinct
buffer.update(0, result)
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = update(buffer1, buffer2)
override def evaluate(buffer: Row): Any = buffer.getAs[mutable.Seq[Int]](0)
}
并像这样使用它:
val mergeUDAF = new MergeListsUDAF()
df.groupBy("acct").agg(mergeUDAF($"device"))