我有一个json文件,其中包含以下数据:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": [
"GML",
"XML"
]
},
"GlossSee": "markup"
}
}
}
}
}
我需要在pyspark中读取此文件,并遍历json中的所有元素。我需要识别结构列的所有结构,数组和数组,并需要为每个结构和数组列创建单独的配置单元表。
例如:
词汇表将是一个以“标题”为列的表
GlossEntry 将是另一个具有“ ID”,“ SortAs”,“ GlossTerm”,“ acronym”,“ abbrev”列的表
将来,随着嵌套结构的增加,数据将不断增长。因此,我将不得不编写一个遍历所有JSON元素并识别所有结构和数组列的通用代码。
有没有办法遍历嵌套结构中的每个元素?
答案 0 :(得分:0)
Spark能够自动解析和推断json模式。将其放入spark数据框后,您可以通过指定json的路径来访问其元素。
json_df = spark.read.json(filepath)
json_df.printSchema()
输出:
root
|-- glossary: struct (nullable = true)
| |-- GlossDiv: struct (nullable = true)
| | |-- GlossList: struct (nullable = true)
| | | |-- GlossEntry: struct (nullable = true)
| | | | |-- Abbrev: string (nullable = true)
| | | | |-- Acronym: string (nullable = true)
| | | | |-- GlossDef: struct (nullable = true)
| | | | | |-- GlossSeeAlso: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- para: string (nullable = true)
| | | | |-- GlossSee: string (nullable = true)
| | | | |-- GlossTerm: string (nullable = true)
| | | | |-- ID: string (nullable = true)
| | | | |-- SortAs: string (nullable = true)
| | |-- title: string (nullable = true)
| |-- title: string (nullable = true)
然后选择要提取的字段:
json_df.select("glossary.title").show()
json_df.select("glossary.GlossDiv.GlossList.GlossEntry.*").select("Abbrev","Acronym","ID","SortAs").show()
提取的输出:
+----------------+
| title|
+----------------+
|example glossary|
+----------------+
+-------------+-------+----+------+
| Abbrev|Acronym| ID|SortAs|
+-------------+-------+----+------+
|ISO 8879:1986| SGML|SGML| SGML|
+-------------+-------+----+------+