这是我的输入数据帧结构
root
|--Name (String)
|--Version (int)
|--Details (array)
类似的东西:
"Name":"json",
"Version":1,
"Details":[
"{
\"Id\":\"123\",
\"TaxDetails\":[\"TaxDetail1\":\"val1\", \"TaxDetail2\":\"val2\"]
}",
"{
\"Id\":\"234\",
\"TaxDetails\":[\"TaxDetail3\":\"val3\", \"TaxDetail4\":\"val4\"]
}"
]
我想在TaxDetails级别将其爆炸:
"Name":"json",
"Version":1,
"TaxDetail":{\"TaxDetail1\":\"val1\"}
"Name":"json",
"Version":1,
"TaxDetail":{\"TaxDetail2\":\"val2\"}
"Name":"json",
"Version":1,
"TaxDetail":{\"TaxDetail3\":\"val3\"}
"Name":"json",
"Version":1,
"TaxDetail":{\"TaxDetail4\":\"va4\"}
我已经用爆炸功能分解了
val explodedDetailDf = inputDf.withColumn("Detail", explode($"Details"))
现在“详细信息”列的数据类型为字符串,而当我尝试这样做时:
val explodedTaxDetail = explodedDetailDf.withColumn("TaxDetail", explode($"Detail.TaxDetails"))
以上操作失败,并出现错误“由于数据类型不匹配而导致的AnalysisException:函数爆炸的输入应为数组或映射类型,而不是字符串”
如何根据其名称爆炸嵌套的json数组?
答案 0 :(得分:3)
explode
将采用map或array类型的值。但不是字符串
来自示例json Detail.TaxDetails
的字符串类型不是数组。
要提取Detail.TaxDetails
字符串类型值,您必须使用
def from_json(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.StructType): org.apache.spark.sql.Column
Note
您的json已损坏,我已如下所示修改了您的json。
scala> val json = """{
| "Name": "json",
| "Version": 1,
| "Details": [
| "{\"Id\":\"123\",\"TaxDetails\":[{\"TaxDetail1\":\"val1\", \"TaxDetail2\":\"val2\"}]}",
| "{\"Id\":\"234\",\"TaxDetails\":[{\"TaxDetail3\":\"val3\", \"TaxDetail4\":\"val4\"}]}"
| ]
| }"""
json: String =
{
"Name": "json",
"Version": 1,
"Details": [
"{\"Id\":\"123\",\"TaxDetails\":[{\"TaxDetail1\":\"val1\", \"TaxDetail2\":\"val2\"}]}",
"{\"Id\":\"234\",\"TaxDetails\":[{\"TaxDetail3\":\"val3\", \"TaxDetail4\":\"val4\"}]}"
]
}
请检查以下代码如何为Detail.TaxDetails
提取值
scala> val df = spark.read.json(Seq(json).toDS)
df: org.apache.spark.sql.DataFrame = [Details: array<string>, Name: string ... 1 more field]
scala> df.printSchema
root
|-- Details: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Name: string (nullable = true)
|-- Version: long (nullable = true)
scala> df.withColumn("details",explode($"details").as("details")).show(false) // inside details array has string values.
+----------------------------------------------------------------------+----+-------+
|details |Name|Version|
+----------------------------------------------------------------------+----+-------+
|{"Id":"123","TaxDetails":[{"TaxDetail1":"val1", "TaxDetail2":"val2"}]}|json|1 |
|{"Id":"234","TaxDetails":[{"TaxDetail3":"val3", "TaxDetail4":"val4"}]}|json|1 |
+----------------------------------------------------------------------+----+-------+
scala> val json = spark.read.json(Seq("""[{"Id": "123","TaxDetails": [{"TaxDetail1": "val1","TaxDetail2": "val2"}]},{"Id": "234","TaxDetails": [{"TaxDetail3": "val3","TaxDetail4": "val4"}]}]""").toDS).schema.json
json: String = {"type":"struct","fields":[{"name":"Id","type":"string","nullable":true,"metadata":{}},{"name":"TaxDetails","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"TaxDetail1","type":"string","nullable":true,"metadata":{}},{"name":"TaxDetail2","type":"string","nullable":true,"metadata":{}},{"name":"TaxDetail3","type":"string","nullable":true,"metadata":{}},{"name":"TaxDetail4","type":"string","nullable":true,"metadata":{}}]},"containsNull":true},"nullable":true,"metadata":{}}]}
scala> val schema = DataType.fromJson(json).asInstanceOf[StructType] // Creating schema for inner string
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Id,StringType,true), StructField(TaxDetails,ArrayType(StructType(StructField(TaxDetail1,StringType,true), StructField(TaxDetail2,StringType,true), StructField(TaxDetail3,StringType,true), StructField(TaxDetail4,StringType,true)),true),true))
scala> spark.time(df.withColumn("details",explode($"details")).withColumn("details",from_json($"details",schema)).withColumn("id",$"details.id").withColumn("taxdetails",explode($"details.taxdetails")).select($"name",$"version",$"id",$"taxdetails.*").show(false))
+----+-------+---+----------+----------+----------+----------+
|name|version|id |TaxDetail1|TaxDetail2|TaxDetail3|TaxDetail4|
+----+-------+---+----------+----------+----------+----------+
|json|1 |123|val1 |val2 |null |null |
|json|1 |234|null |null |val3 |val4 |
+----+-------+---+----------+----------+----------+----------+
scala>
Updated
以上,我已手动获取json并创建了架构。请检查以下代码,以从可用数据中获取架构。
scala> spark.read.json(df.withColumn("details",explode($"details").as("details")).select("details").map(_.getAs[String](0))).printSchema
root
|-- Id: string (nullable = true)
|-- TaxDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- TaxDetail1: string (nullable = true)
| | |-- TaxDetail2: string (nullable = true)
| | |-- TaxDetail3: string (nullable = true)
| | |-- TaxDetail4: string (nullable = true)
scala> spark.read.json(df.withColumn("details",explode($"details").as("details")).select("details").map(_.getAs[String](0))).schema
res12: org.apache.spark.sql.types.StructType = StructType(StructField(Id,StringType,true), StructField(TaxDetails,ArrayType(StructType(StructField(TaxDetail1,StringType,true), StructField(TaxDetail2,StringType,true), StructField(TaxDetail3,StringType,true), StructField(TaxDetail4,StringType,true)),true),true))
scala> val schema = spark.read.json(df.withColumn("details",explode($"details").as("details")).select("details").map(_.getAs[String](0))).schema
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Id,StringType,true), StructField(TaxDetails,ArrayType(StructType(StructField(TaxDetail1,StringType,true), StructField(TaxDetail2,StringType,true), StructField(TaxDetail3,StringType,true), StructField(TaxDetail4,StringType,true)),true),true))
scala> spark.time(df.withColumn("details",explode($"details")).withColumn("details",from_json($"details",schema)).withColumn("id",$"details.id").withColumn("taxdetails",explode($"details.taxdetails")).select($"name",$"version",$"id",$"taxdetails.*").show(false))
+----+-------+---+----------+----------+----------+----------+
|name|version|id |TaxDetail1|TaxDetail2|TaxDetail3|TaxDetail4|
+----+-------+---+----------+----------+----------+----------+
|json|1 |123|val1 |val2 |null |null |
|json|1 |234|null |null |val3 |val4 |
+----+-------+---+----------+----------+----------+----------+
Time taken: 212 ms
scala>
答案 1 :(得分:1)
由于您提供的早期json已损坏,因此我以这种方式对json进行了格式化,可以将您的explode
使用2次并展平数据框。
实现如下...
package examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object JsonTest extends App {
Logger.getLogger("org").setLevel(Level.OFF)
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val jsonString =
"""
|{
| "Name": "json",
| "Version": "1",
| "Details": [
| {
| "Id": "123",
| "TaxDetails": [
| {
| "TaxDetail1": "val1",
| "TaxDetail2": "val2"
| }
| ]
| },
| {
| "Id":"234",
| "TaxDetails":[
| {
| "TaxDetail3":"val3"
| , "TaxDetail4":"val4"
| }
| ]
|}
| ]
|}
""".stripMargin
val df3 = spark.read.json(Seq(jsonString).toDS)
df3.printSchema()
df3.show(false)
val explodedDetailDf = df3.withColumn("Detail", explode($"Details"))
// explodedDetailDf.show(false)
val explodedTaxDetail = explodedDetailDf.withColumn("TaxDetail", explode($"Detail.TaxDetails"))
explodedTaxDetail.show(false)
val finaldf = explodedTaxDetail.select($"Name", $"Version"
, to_json(struct
(col("TaxDetail.TaxDetail1").as("TaxDetail1"))
).as("TaxDetails"))
.union(
explodedTaxDetail.select($"Name", $"Version"
, to_json(struct
(col("TaxDetail.TaxDetail2").as("TaxDetail2"))
).as("TaxDetails"))
)
.union(
explodedTaxDetail.select($"Name", $"Version"
, to_json(struct
(col("TaxDetail.TaxDetail3").as("TaxDetail3"))
).as("TaxDetails"))
)
.union(
explodedTaxDetail.select($"Name", $"Version"
, to_json(struct
(col("TaxDetail.TaxDetail4").as("TaxDetail4"))
).as("TaxDetails"))
).filter(!($"TaxDetails" === "{}"))
finaldf.show(false)
finaldf.toJSON.show(false)
}
结果:
root
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: string (nullable = true)
| | |-- TaxDetails: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- TaxDetail1: string (nullable = true)
| | | | |-- TaxDetail2: string (nullable = true)
| | | | |-- TaxDetail3: string (nullable = true)
| | | | |-- TaxDetail4: string (nullable = true)
|-- Name: string (nullable = true)
|-- Version: string (nullable = true)
+---------------------------------------------------+----+-------+
|Details |Name|Version|
+---------------------------------------------------+----+-------+
|[[123, [[val1, val2,,]]], [234, [[,, val3, val4]]]]|json|1 |
+---------------------------------------------------+----+-------+
+---------------------------------------------------+----+-------+------------------------+---------------+
|Details |Name|Version|Detail |TaxDetail |
+---------------------------------------------------+----+-------+------------------------+---------------+
|[[123, [[val1, val2,,]]], [234, [[,, val3, val4]]]]|json|1 |[123, [[val1, val2,,]]] |[val1, val2,,] |
|[[123, [[val1, val2,,]]], [234, [[,, val3, val4]]]]|json|1 |[234, [[,, val3, val4]]]|[,, val3, val4]|
+---------------------------------------------------+----+-------+------------------------+---------------+
+----+-------+---------------------+
|Name|Version|TaxDetails |
+----+-------+---------------------+
|json|1 |{"TaxDetail1":"val1"}|
|json|1 |{"TaxDetail2":"val2"}|
|json|1 |{"TaxDetail3":"val3"}|
|json|1 |{"TaxDetail4":"val4"}|
+----+-------+---------------------+
您期望的最终输出:
+----------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------+
|{"Name":"json","Version":"1","TaxDetails":"{\"TaxDetail1\":\"val1\"}"}|
|{"Name":"json","Version":"1","TaxDetails":"{\"TaxDetail2\":\"val2\"}"}|
|{"Name":"json","Version":"1","TaxDetails":"{\"TaxDetail3\":\"val3\"}"}|
|{"Name":"json","Version":"1","TaxDetails":"{\"TaxDetail4\":\"val4\"}"}|
+----------------------------------------------------------------------+