我正在努力用Spark扩展json数组列。
我有一个看起来像这样的数据框:
+------+------------------------------------------------------------------------+
|id |struct |
+------+------------------------------------------------------------------------+
| 1 | [{_name: BankAccount, _value: 123456}, {_name: Balance, _value: 500$}]|
| 2 | [{_name: BankAccount, _value: 098765}, {_name: Balance, _value: 100$}]|
| 3 | [{_name: BankAccount, _value: 135790}, {_name: Balance, _value: 200$}]|
+------+------------------------------------------------------------------------+
我希望它像
+------+------------+--------+
|id | BankAccount| Balance|
+------+------------+--------+
| 1 | 123456 | 500$ |
| 2 | 098765 | 100$ |
| 3 | 135790 | 200$ |
+------+------------+--------+
当然,它并没有真正爆发,但是我离我所需要的结果还遥不可及。
感谢您的帮助!
答案 0 :(得分:1)
检查以下代码。
为简单起见,我从struct
而不是data
的示例数据开始。.:)
val df = Seq((1,"""[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]"""),(2,"""[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]"""),(3,"""[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]""")).toDF("id","data")
打印数据架构
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
显示示例数据
scala> df.show(false)
+---+--------------------------------------------------------------------------------+
|id |data |
+---+--------------------------------------------------------------------------------+
|1 |[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]|
|2 |[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]|
|3 |[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]|
+---+--------------------------------------------------------------------------------+
为json数据创建架构
scala> val schema = ArrayType(MapType(StringType,StringType))
使用explode
,groupBy
和pivot
获得预期的结果。
注意-您可能需要根据需要微调以下代码。
scala>
df
.withColumn("data",explode(from_json($"data",schema)))
.select($"id",struct($"data"("_name").as("key"),$"data"("_value").as("value")).as("data"))
.select($"id",$"data.*")
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
.select("id","BankAccount","Balance")
.orderBy($"id".asc)
.show(false)
最终结果
+---+-----------+-------+
|id |BankAccount|Balance|
+---+-----------+-------+
|1 |123456 |500$ |
|2 |098765 |100$ |
|3 |135790 |200$ |
+---+-----------+-------+