嵌套JSON展平Spark数据框

时间:2020-10-07 04:30:50

标签: dataframe apache-spark apache-spark-sql

我正在尝试从嵌套的jsonString创建一个数据帧并将其拆分为多个数据帧,即外部元素数据将转到一个数据帧,而嵌套的子数据将转到另一个数据帧。可能有多个嵌套元素。我看了其他帖子,没有一个提供以下情况的工作示例。下面是一个示例,其中州的数量是动态的,我想将国家信息和州信息存储在2个单独的hdfs文件夹中。因此,父数据框将保持如下所示的行。

val jsonStr =“”“ {” country“:” US“,” ISD“:” 001“,” states“:[{” state1“:” NJ“,” state2“:” NY“,” state3 “:” PA“}]}”“”

val countryDf = spark.read.json(Seq(jsonStr).toDS)

countryDf.show(false)

+---+-------+--------------+
|ISD|country|states        |
+---+-------+--------------+
|001|US     |[[NJ, NY, PA]]|
+---+-------+--------------+

countryDf.withColumn("states",explode($"states")).show(false)



val statesDf = countryDf.select(explode(countryDf("states").as("states")))
statesDf.show(false)
+------------+
|col         |
+------------+
|[NJ, NY, PA]|
+------------+

Expected out put  
2 Dataframes 

countryDf
+---+-------+
|ISD|country|
+---+-------+
|001|US     |
+---+-------+

statesDf 

+------+-------+-------+-------+
country| state1| state2|  state3
+------+---------------+-------+
US     |  NJ      NY      PA
+------+-------+-------+-------+

我查看了堆栈溢出中与嵌套json展平有关的其他问题。没有人有一个可行的解决方案。

2 个答案:

答案 0 :(得分:1)

这里有一些代码可以完成这项工作。您应该考虑性能以及列数是否很大。我已经收集了所有地图字段并将其添加到数据框。

val jsonStr="""{"country":"US","ISD":"001","states":[{"state1":"NJ","state2":"NY","state3":"PA"}]}"""
import spark.implicits._

val countryDf = spark.read.json(Seq(jsonStr).toDS)

countryDf.show(false)
val statesDf = countryDf.select($"country", explode($"states").as("states"))

val index = statesDf.schema.fieldIndex("states")
val stateSchema = statesDf.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
stateSchema.fields.foreach(field =>{
  columns.add(lit(field.name))
  columns.add(col( "state." + field.name))
})


val s2 = statesDf
  .withColumn("statesMap", map(columns.toSeq: _*))

val allMapKeys = s2.select(explode($"statesMap")).select($"key").distinct.collect().map(_.get(0).toString)

val s3 = allMapKeys.foldLeft(s2)((a, b) => a.withColumn(b, a("statesMap")(b)))
  .drop("statesMap")
s3.show(false)

答案 1 :(得分:1)

当您读取嵌套的JSON并将其转换为数据集时,嵌套的部分将作为结构类型存储。因此,您必须考虑展平数据框中的结构类型。

val jsonStr="""{"country":"US","ISD":"001","states":[{"state1":"NJ","state2":"NY","state3":"PA"}]}"""
val countryDf = spark.read.json(Seq(jsonStr).toDS)

countryDf.show(false)
+---+-------+--------------+
|ISD|country|states        |
+---+-------+--------------+
|001|US     |[[NJ, NY, PA]]|
+---+-------+--------------+

val countryDfExploded = countryDf.withColumn("states",explode($"states"))
countryDfExploded.show(false)
+---+-------+------------+
|ISD|country|states      |
+---+-------+------------+
|001|US     |[NJ, NY, PA]|
+---+-------+------------+

val countrySelectDf = countryDfExploded.select($"ISD", $"country")
countrySelectDf.show(false)
+---+-------+
|ISD|country|
+---+-------+
|001|US     |
+---+-------+

val statesDf = countryDfExploded.select( $"country",$"states.*")
statesDf.show(false)
+-------+------+------+------+
|country|state1|state2|state3|
+-------+------+------+------+
|US     |NJ    |NY    |PA    |
+-------+------+------+------+