JSON数组的爆炸列

时间:2020-07-06 10:09:07

标签: scala apache-spark apache-spark-sql

我正在努力用Spark扩展json数组列。

我有一个看起来像这样的数据框:

+------+------------------------------------------------------------------------+
|id    |struct                                                                  |
+------+------------------------------------------------------------------------+
|  1   |  [{_name: BankAccount, _value: 123456}, {_name: Balance, _value: 500$}]|
|  2   |  [{_name: BankAccount, _value: 098765}, {_name: Balance, _value: 100$}]|
|  3   |  [{_name: BankAccount, _value: 135790}, {_name: Balance, _value: 200$}]|
+------+------------------------------------------------------------------------+

我希望它像

+------+------------+--------+
|id    | BankAccount| Balance|
+------+------------+--------+
|  1   |   123456   | 500$   |
|  2   |   098765   | 100$   |
|  3   |   135790   | 200$   |
+------+------------+--------+

当然,它并没有真正爆发,但是我离我所需要的结果还遥不可及。

感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

检查以下代码。

为简单起见,我从struct而不是data的示例数据开始。.:)

 val df = Seq((1,"""[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]"""),(2,"""[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]"""),(3,"""[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]""")).toDF("id","data")

打印数据架构

scala> df.printSchema
root
 |-- id: integer (nullable = false)
 |-- data: string (nullable = true)

显示示例数据

scala> df.show(false)
+---+--------------------------------------------------------------------------------+
|id |data                                                                            |
+---+--------------------------------------------------------------------------------+
|1  |[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]|
|2  |[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]|
|3  |[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]|
+---+--------------------------------------------------------------------------------+

为json数据创建架构

scala> val schema = ArrayType(MapType(StringType,StringType))

使用explodegroupBypivot获得预期的结果。

注意-您可能需要根据需要微调以下代码。

scala> 

df
.withColumn("data",explode(from_json($"data",schema)))
.select($"id",struct($"data"("_name").as("key"),$"data"("_value").as("value")).as("data"))
.select($"id",$"data.*")
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
.select("id","BankAccount","Balance")
.orderBy($"id".asc)
.show(false)

最终结果

+---+-----------+-------+
|id |BankAccount|Balance|
+---+-----------+-------+
|1  |123456     |500$   |
|2  |098765     |100$   |
|3  |135790     |200$   |
+---+-----------+-------+