将扁平化的数据帧转换为Spark中的结构

时间:2019-10-29 10:24:52

标签: scala dataframe apache-spark

我有一个深层嵌套的JSON文件,我必须对其进行处理,为此,我不得不将它们展平,因为找不到找到哈希某些深层嵌套字段的方法。这是我的dataframe的样子(展平后):

scala> flattendedJSON.printSchema
root
 |-- header_appID: string (nullable = true)
 |-- header_appVersion: string (nullable = true)
 |-- header_userID: string (nullable = true)
 |-- body_cardId: string (nullable = true)
 |-- body_cardStatus: string (nullable = true)
 |-- body_cardType: string (nullable = true)
 |-- header_userAgent_browser: string (nullable = true)
 |-- header_userAgent_browserVersion: string (nullable = true)
 |-- header_userAgent_deviceName: string (nullable = true)
 |-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
 |-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)

我需要将其转换回原始结构(在展平之前):

scala> nestedJson.printSchema
root
 |-- header: struct (nullable = true)
 |    |-- appID: string (nullable = true)
 |    |-- appVersion: string (nullable = true)
 |    |-- userAgent: struct (nullable = true)
 |    |    |-- browser: string (nullable = true)
 |    |    |-- browserVersion: string (nullable = true)
 |    |    |-- deviceName: string (nullable = true)
 |-- body: struct (nullable = true)
 |    |-- beneficiary: struct (nullable = true)
 |    |    |-- beneficiaryAccounts: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- beneficiaryAccountOwner: string (nullable = true)
 |    |    |-- beneficiaryPhoneNumbers: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- beneficiaryPhoneNumber: string (nullable = true)
 |    |-- cardId: string (nullable = true)
 |    |-- cardStatus: string (nullable = true)
 |    |-- cardType: string (nullable = true)

我已经设法通过单个嵌套字段来做到这一点,但是如果更多,它就无法工作,而且我也找不到合适的方法。这是我尝试过的:

 val structColumns = flattendedJSON.columns.filter(_.contains("_"))
  val structColumnsMap = structColumns.map(_.split("\\_")).
  groupBy(_(0)).mapValues(_.map(_(1)))

  val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
  val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
  accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)

如果我有一个嵌套对象(例如header_appID),则可以正常工作,但是在header_userAgent_browser的情况下,我得到一个例外:

  

org.apache.spark.sql.AnalysisException:无法解决   'header_userAgent'给定输入列:..

使用Spark 2.3Scala 2.11.8

1 个答案:

答案 0 :(得分:0)

我建议使用case classesDataset一起使用,而不要展平DF,然后再次尝试转换为旧的json格式。即使它具有嵌套对象,您也可以定义一组case classes来进行强制转换。它使您可以使用对象符号来使事情变得比DF更容易。 您可以在一些工具中提供json的样本,并为您生成类(我使用的是https://json2caseclass.cleverapps.io)。 无论如何,如果您想从DF进行转换,可以选择在Dataset上使用map创建一个DF。像这样:

case class NestedNode(fieldC: String, fieldD: String)   // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String) 

Seq(
  FlattenData("A1", "B1", "C1"),
  FlattenData("A2", "B2", "C2"),
  FlattenData("A3", "B3", "C3")
).toDF
 .as[FlattenData] // Cast it to access with object notation
 .map(flattenItem=>{
    MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
  })

最后,由类定义的架构将由yourDS.write.mode(your_save_mode).json(your_target_path)

使用