我有一个包含以下信息的扁平文件
Car myCar = new Car();
if (myCar is Toy)
{
//...
}
使用
展平A B C
5 1 [1,2.....10]
5 1 [11,12,13]
5 2 [1,2,3,15,16]
6 1 [1,2,3]
7 3 [4,5,6,7]
我需要根据包含相同值的列C在列B上聚合,并将此结构转换为格式
explode(arraySlice(col(C), lit(0), lit(10))))
我在火花上使用scala。我怎么能这样做呢?
答案 0 :(得分:1)
将数据框设为
+---+---+-------------------------------+
|A |B |C |
+---+---+-------------------------------+
|5 |1 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|5 |1 |[11, 12, 13] |
|5 |2 |[1, 2, 3, 15, 16] |
|6 |1 |[1, 2, 3] |
|7 |3 |[4, 5, 6, 7] |
+---+---+-------------------------------+
您可以按以下方式进行爆炸和聚合
import org.apache.spark.sql.functions._
df.withColumn("C", explode(col("C")))
.groupBy("A", "C")
.agg(collect_list("B").as("B"))
.select("A", "B", "C")
.show(false)
你应该得到你想要的输出
+---+------+---+
|A |B |C |
+---+------+---+
|5 |[2] |16 |
|6 |[1] |1 |
|7 |[3] |4 |
|5 |[1] |7 |
|5 |[1] |6 |
|5 |[1] |4 |
|5 |[1] |12 |
|5 |[1] |13 |
|6 |[1] |3 |
|7 |[3] |5 |
|5 |[1] |10 |
|6 |[1] |2 |
|5 |[1] |8 |
|5 |[2] |15 |
|5 |[1, 2]|2 |
|5 |[1, 2]|1 |
|5 |[1, 2]|3 |
|7 |[3] |7 |
|5 |[1] |9 |
|5 |[1] |11 |
+---+------+---+