我正在尝试从嵌套的jsonString创建一个数据帧并将其拆分为多个数据帧,即外部元素数据将转到一个数据帧,而嵌套的子数据将转到另一个数据帧。可能有多个嵌套元素。我看了其他帖子,没有一个提供以下情况的工作示例。下面是一个示例,其中州的数量是动态的,我想将国家信息和州信息存储在2个单独的hdfs文件夹中。因此,父数据框将保持如下所示的行。
val jsonStr =“”“ {” country“:” US“,” ISD“:” 001“,” states“:[{” state1“:” NJ“,” state2“:” NY“,” state3 “:” PA“}]}”“”
val countryDf = spark.read.json(Seq(jsonStr).toDS)
countryDf.show(false)
+---+-------+--------------+
|ISD|country|states |
+---+-------+--------------+
|001|US |[[NJ, NY, PA]]|
+---+-------+--------------+
countryDf.withColumn("states",explode($"states")).show(false)
val statesDf = countryDf.select(explode(countryDf("states").as("states")))
statesDf.show(false)
+------------+
|col |
+------------+
|[NJ, NY, PA]|
+------------+
Expected out put
2 Dataframes
countryDf
+---+-------+
|ISD|country|
+---+-------+
|001|US |
+---+-------+
statesDf
+------+-------+-------+-------+
country| state1| state2| state3
+------+---------------+-------+
US | NJ NY PA
+------+-------+-------+-------+
我查看了堆栈溢出中与嵌套json展平有关的其他问题。没有人有一个可行的解决方案。
答案 0 :(得分:1)
这里有一些代码可以完成这项工作。您应该考虑性能以及列数是否很大。我已经收集了所有地图字段并将其添加到数据框。
val jsonStr="""{"country":"US","ISD":"001","states":[{"state1":"NJ","state2":"NY","state3":"PA"}]}"""
import spark.implicits._
val countryDf = spark.read.json(Seq(jsonStr).toDS)
countryDf.show(false)
val statesDf = countryDf.select($"country", explode($"states").as("states"))
val index = statesDf.schema.fieldIndex("states")
val stateSchema = statesDf.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
stateSchema.fields.foreach(field =>{
columns.add(lit(field.name))
columns.add(col( "state." + field.name))
})
val s2 = statesDf
.withColumn("statesMap", map(columns.toSeq: _*))
val allMapKeys = s2.select(explode($"statesMap")).select($"key").distinct.collect().map(_.get(0).toString)
val s3 = allMapKeys.foldLeft(s2)((a, b) => a.withColumn(b, a("statesMap")(b)))
.drop("statesMap")
s3.show(false)
答案 1 :(得分:1)
当您读取嵌套的JSON并将其转换为数据集时,嵌套的部分将作为结构类型存储。因此,您必须考虑展平数据框中的结构类型。
val jsonStr="""{"country":"US","ISD":"001","states":[{"state1":"NJ","state2":"NY","state3":"PA"}]}"""
val countryDf = spark.read.json(Seq(jsonStr).toDS)
countryDf.show(false)
+---+-------+--------------+
|ISD|country|states |
+---+-------+--------------+
|001|US |[[NJ, NY, PA]]|
+---+-------+--------------+
val countryDfExploded = countryDf.withColumn("states",explode($"states"))
countryDfExploded.show(false)
+---+-------+------------+
|ISD|country|states |
+---+-------+------------+
|001|US |[NJ, NY, PA]|
+---+-------+------------+
val countrySelectDf = countryDfExploded.select($"ISD", $"country")
countrySelectDf.show(false)
+---+-------+
|ISD|country|
+---+-------+
|001|US |
+---+-------+
val statesDf = countryDfExploded.select( $"country",$"states.*")
statesDf.show(false)
+-------+------+------+------+
|country|state1|state2|state3|
+-------+------+------+------+
|US |NJ |NY |PA |
+-------+------+------+------+