将展平的json转换为嵌套结构

时间:2018-03-25 13:44:28

标签: scala apache-spark

我有一个包含以下信息的扁平文件

Car myCar = new Car();
if (myCar is Toy)
{
   //...
}

使用

展平
A    B     C
5    1     [1,2.....10]
5    1     [11,12,13]
5    2     [1,2,3,15,16]
6    1     [1,2,3]
7    3     [4,5,6,7]

我需要根据包含相同值的列C在列B上聚合,并将此结构转换为格式

explode(arraySlice(col(C), lit(0), lit(10))))

我在火花上使用scala。我怎么能这样做呢?

1 个答案:

答案 0 :(得分:1)

将数据框设为

+---+---+-------------------------------+
|A  |B  |C                              |
+---+---+-------------------------------+
|5  |1  |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|5  |1  |[11, 12, 13]                   |
|5  |2  |[1, 2, 3, 15, 16]              |
|6  |1  |[1, 2, 3]                      |
|7  |3  |[4, 5, 6, 7]                   |
+---+---+-------------------------------+

您可以按以下方式进行爆炸和聚合

import org.apache.spark.sql.functions._
df.withColumn("C", explode(col("C")))
  .groupBy("A", "C")
  .agg(collect_list("B").as("B"))
  .select("A", "B", "C")
  .show(false)

你应该得到你想要的输出

+---+------+---+
|A  |B     |C  |
+---+------+---+
|5  |[2]   |16 |
|6  |[1]   |1  |
|7  |[3]   |4  |
|5  |[1]   |7  |
|5  |[1]   |6  |
|5  |[1]   |4  |
|5  |[1]   |12 |
|5  |[1]   |13 |
|6  |[1]   |3  |
|7  |[3]   |5  |
|5  |[1]   |10 |
|6  |[1]   |2  |
|5  |[1]   |8  |
|5  |[2]   |15 |
|5  |[1, 2]|2  |
|5  |[1, 2]|1  |
|5  |[1, 2]|3  |
|7  |[3]   |7  |
|5  |[1]   |9  |
|5  |[1]   |11 |
+---+------+---+