我想更改Pyspark上Dataframe的结构。
root
|-- roster_id: long (nullable = true)
|-- members: struct (nullable = true)
| |-- m10: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- hobby_1: string (nullable = true
| | |-- hobby_2: string (nullable = true
| |-- m15: struct (nullable = true)
| | |-- name: string (nullable = true)
~~~~~~~
我想
root
|-- roster_id: long (nullable = true)
|-- member_id: string (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- hobby_1: string (nullable = true)
|-- hobby_2: string (nullable = true)
但是有问题。
・我不知道“ members.X”中的含义。
・“ members.X.X”(例如hobby_2)可能不取决于成员。
我认为这很困难。有办法吗?
请告诉我是否不适合使用Pyspark。
示例
RowData
{
"roster_id": "abc",
"members": {
"m10": {
"name": "John",
"address": "Tokyo",
"hobby_1": "Baseball",
"hobby_2": "Teniss"
},
"m15": {
"name": "Paul",
"address": "NY",
"hobby_1": "Music"
}
}
}
我想
+---------+---------+-------+-------+--------+-------+
|roster_id|member_id| name| adress|hobby_1 |hobby_2|
+---------+---------+-------+-------+--------+-------+
| abc| m10| John| Tokyo|Baseball| Music|
+---------+---------+-------+-------+--------+-------+
| abc| m15| Paul| NY| Music| null|
+---------+---------+-------+-------+--------+-------+