拼合具有嵌套数组和StructType Spark Scala的Parquet文件

时间:2019-03-20 20:18:10

标签: scala apache-spark apache-spark-sql parquet flatten

我希望通过Scala动态地展平Spark中的镶木地板文件。我想知道实现这一目标的有效方法。

实木复合地板文件在多个深度级别包含多个“数组”和“结构类型嵌套”。实木复合地板文件架构将来可能会更改,因此我无法对任何属性进行硬编码。所需的最终结果是平整的定界文件。

是否将使用平面图并递归分解工作的解决方案?

示例架构:

|-- exCar: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exCarOne: string (nullable = true)
 |    |    |-- exCarTwo: string (nullable = true)
 |    |    |-- exCarThree: string (nullable = true)
 |-- exProduct: string (nullable = true)
 |-- exName: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exNameOne: string (nullable = true)
 |    |    |-- exNameTwo: string (nullable = true)
 |    |    |-- exNameThree: string (nullable = true)
 |    |    |-- exNameFour: string (nullable = true)
 |    |    |-- exNameCode: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- exNameCodeOne: string (nullable = true)
 |    |    |    |    |-- exNameCodeTwo: string (nullable = true)
 |    |    |-- exColor: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- exColorOne: string (nullable = true)
 |    |    |    |    |-- exColorTwo: string (nullable = true)
 |    |    |    |    |-- exWheelColor: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- exWheelColorOne: string (nullable = true)
 |    |    |    |    |    |    |-- exWheelColorTwo: string (nullable = true)
 |    |    |    |    |    |    |--exWheelColorThree: string (nullable =true)
 |    |    |-- exGlass: string (nullable = true)
 |-- exDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exBill: string (nullable = true)
 |    |    |-- exAccount: string (nullable = true)
 |    |    |-- exLoan: string (nullable = true)
 |    |    |-- exRate: string (nullable = true)

所需的输出模式:

 exCar.exCarOne
 exCar.exCarTwo
 exCar.exCarThree
 exProduct
 exName.exNameOne
 exName.exNameTwo
 exName.exNameThree
 exName.exNameFour
 exName.exNameCode.exNameCodeOne
 exName.exNameCode.exNameCodeTwo
 exName.exColor.exColorOne
 exName.exColor.exColorTwo
 exName.exColor.exWheelColor.exWheelColorOne
 exName.exColor.exWheelColor.exWheelColorTwo
 exName.exColor.exWheelColor.exWheelColorThree
 exName.exGlass
 exDetails.exBill
 exDetails.exAccount
 exDetails.exLoan
 exDetails.exRate

1 个答案:

答案 0 :(得分:0)

需要完成两件事:

1)将数组列从最外面的嵌套数组分解为位于内部的数组:展开exName(给您许多包含exColor的json行),然后exColor然后爆炸,即可访问exWheelColor等。

2)将嵌套的json投影到单独的列中。