我当前正在解析一个巨大的JSON,该JSON具有包含内部结构和数组的嵌套结构。 我正在获取一个最初看起来像这样的DataFrame:
val salesRawDf: DataFrame = sqlContext.read.json(salesRDD)
+------------------+---------------+
| Sales | Stores |
+------------------+---------------+
|[[[[null,null,....|[Store1... |
+------------------+---------------+
它具有疯狂的模式:
root
|-- sales: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- current_sales: struct (nullable = true)
| | | |-- sales_metrics: struct (nullable = true)
| | | | |-- currency: string (nullable = true)
| | | | |-- metric1: string (nullable = true)
| | | | |-- metric2: string (nullable = true)
| | | | |-- info: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | | | |-- amount: long (nullable = true)
| | | | | | |-- metric3: string (nullable = true)
| | | | | | |-- metric4: string (nullable = true)
| | | | |-- price: string (nullable = true)
| | | | |-- metric3: string (nullable = true)
所以我试图将其展平:
val salesDf: DataFrame = salesRawDf.select($"stores", explode($"sales").as("sl"))
.select($"stores.id", $"stores.name", $"sl.id", $"sl.current_sales")
salesDF
仍未展平current_sales,因此我实现了一种方法来进一步递归地展平该模式。
def flattenFields(parent: String, schema: StructType): Seq[String] = schema.fields.flatMap {
case StructField(name, innerStruct: StructType, _, _) => flattenFields(parent + name + ".", innerStruct)
case StructField(name, _, _, _) => Seq(s"$parent$name")
}
val theFields: Seq[String] = flattenFields("", salesDf .schema)
val widesalesDF: DataFrame = salesDf .select(fields.map(name => $"$name" as name): _*)
在那之后,widesalesDF's
模式除了current_sales.sales_metrics.info
之类的数组外,几乎几乎是平坦的。因此,我一直坚持如何处理这些数组。有没有一种方法可以扩展flattenFields
方法来展平结构内部的数组?