Scala-如何在DataFrame中使用嵌套数组爆炸结构类型?

时间:2018-07-31 13:16:54

标签: json scala apache-spark apache-spark-sql

我当前正在解析一个巨大的JSON,该JSON具有包含内部结构和数组的嵌套结构。 我正在获取一个最初看起来像这样的DataFrame:

val salesRawDf: DataFrame = sqlContext.read.json(salesRDD)

+------------------+---------------+
|      Sales       |    Stores     |
+------------------+---------------+
|[[[[null,null,....|[Store1...     |
+------------------+---------------+

它具有疯狂的模式:

root
 |-- sales: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- current_sales: struct (nullable = true)
 |    |    |    |-- sales_metrics: struct (nullable = true)
 |    |    |    |    |-- currency: string (nullable = true)
 |    |    |    |    |-- metric1: string (nullable = true)
 |    |    |    |    |-- metric2: string (nullable = true)
 |    |    |    |    |-- info: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |-- amount: long (nullable = true)
 |    |    |    |    |    |    |-- metric3: string (nullable = true)
 |    |    |    |    |    |    |-- metric4: string (nullable = true)
 |    |    |    |    |-- price: string (nullable = true)
 |    |    |    |    |-- metric3: string (nullable = true)

所以我试图将其展平:

val salesDf: DataFrame = salesRawDf.select($"stores", explode($"sales").as("sl"))
  .select($"stores.id", $"stores.name", $"sl.id", $"sl.current_sales")

salesDF仍未展平current_sales,因此我实现了一种方法来进一步递归地展平该模式。

def flattenFields(parent: String, schema: StructType): Seq[String] = schema.fields.flatMap {
  case StructField(name, innerStruct: StructType, _, _) => flattenFields(parent + name + ".", innerStruct)
  case StructField(name, _, _, _) => Seq(s"$parent$name")
}
val theFields: Seq[String] = flattenFields("", salesDf .schema)

val widesalesDF: DataFrame = salesDf .select(fields.map(name => $"$name" as name): _*)

在那之后,widesalesDF's模式除了current_sales.sales_metrics.info之类的数组外,几乎几乎是平坦的。因此,我一直坚持如何处理这些数组。有没有一种方法可以扩展flattenFields方法来展平结构内部的数组?

0 个答案:

没有答案