我有以下json(位于path_json
的本地文件系统中):
[
{
"name": "John",
"email": "john@hisemail.com",
"gender": "Male",
"dict_of_columns": [
{
"column_name": "hobbie",
"columns_value": "guitar"
},
{
"column_name": "book",
"columns_value": "1984"
}
]
},
{
"name": "Mary",
"email": "mary@heremail.com",
"gender": "Female",
"dict_of_columns": [
{
"column_name": "language",
"columns_value": "Python"
},
{
"column_name": "job",
"columns_value": "analyst"
}
]
}
]
如您所见,这是一个嵌套的json。 我正在使用以下命令阅读它:
df = spark.read.option("multiline", "true").json(path_json)
好的。现在,它为我生成以下DataFrame:
+------------------------------------+-------------------+------+----+
|dict_of_columns |email |gender|name|
+------------------------------------+-------------------+------+----+
|[[hobbie, guitar], [book, 1984]] |john@hisemail.com |Male |John|
|[[language, Python], [job, analyst]]|mary@heremail.com |Female|Mary|
+------------------------------------+-------------------+------+----+
我想知道是否有一种方法可以产生以下数据帧:
+----+-----------------+------+------+-------+--------+----+
|book|email |gender|hobbie|job |language|name|
+----+-----------------+------+------+-------+--------+----+
|1984|john@hisemail.com|Male |guitar|null |null |John|
|null|mary@heremail.com|Female|null |analyst|Python |Mary|
+----+-----------------+------+------+-------+--------+----+
一些评论:
column_name
(有很多)email
在每一行中都是唯一的,因此如果需要连接,可以将其用作键。我之前尝试过这种方法:创建包含列[name,gender,email]
的主数据框以及包含字典的每一行的其他数据框。但是没有成功(它没有良好的性能)。非常感谢您!