我有一个深层嵌套的JSON
文件,我必须对其进行处理,为此,我不得不将它们展平,因为找不到找到哈希某些深层嵌套字段的方法。这是我的dataframe
的样子(展平后):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
我需要将其转换回原始结构(在展平之前):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
我已经设法通过单个嵌套字段来做到这一点,但是如果更多,它就无法工作,而且我也找不到合适的方法。这是我尝试过的:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
如果我有一个嵌套对象(例如header_appID
),则可以正常工作,但是在header_userAgent_browser
的情况下,我得到一个例外:
org.apache.spark.sql.AnalysisException:无法解决 '
header_userAgent
'给定输入列:..
使用Spark 2.3
和Scala 2.11.8
答案 0 :(得分:0)
我建议使用case classes
与Dataset
一起使用,而不要展平DF
,然后再次尝试转换为旧的json
格式。即使它具有嵌套对象,您也可以定义一组case classes
来进行强制转换。它使您可以使用对象符号来使事情变得比DF
更容易。
您可以在一些工具中提供json
的样本,并为您生成类(我使用的是https://json2caseclass.cleverapps.io)。
无论如何,如果您想从DF
进行转换,可以选择在Dataset
上使用map
创建一个DF
。像这样:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
最后,由类定义的架构将由yourDS.write.mode(your_save_mode).json(your_target_path)