遍历Json对象

时间:2018-12-13 13:26:09

标签: json pyspark

我有一个json文件,其中包含以下数据:

    {
  "glossary": {
    "title": "example glossary",
    "GlossDiv": {
      "title": "S",
      "GlossList": {
        "GlossEntry": {
          "ID": "SGML",
          "SortAs": "SGML",
          "GlossTerm": "Standard Generalized Markup Language",
          "Acronym": "SGML",
          "Abbrev": "ISO 8879:1986",
          "GlossDef": {
            "para": "A meta-markup language, used to create markup languages such as DocBook.",
            "GlossSeeAlso": [
              "GML",
              "XML"
            ]
          },
          "GlossSee": "markup"
        }
      }
    }
  }
}

我需要在pyspark中读取此文件,并遍历json中的所有元素。我需要识别结构列的所有结构,数组和数组,并需要为每个结构和数组列创建单独的配置单元表。

例如:

词汇表将是一个以“标题”为列的表

GlossEntry 将是另一个具有“ ID”,“ SortAs”,“ GlossTerm”,“ acronym”,“ abbrev”列的表

将来,随着嵌套结构的增加,数据将不断增长。因此,我将不得不编写一个遍历所有JSON元素并识别所有结构和数组列的通用代码。

有没有办法遍历嵌套结构中的每个元素?

1 个答案:

答案 0 :(得分:0)

Spark能够自动解析和推断json模式。将其放入spark数据框后,您可以通过指定json的路径来访问其元素。

json_df = spark.read.json(filepath)
json_df.printSchema()

输出:

root
 |-- glossary: struct (nullable = true)
 |    |-- GlossDiv: struct (nullable = true)
 |    |    |-- GlossList: struct (nullable = true)
 |    |    |    |-- GlossEntry: struct (nullable = true)
 |    |    |    |    |-- Abbrev: string (nullable = true)
 |    |    |    |    |-- Acronym: string (nullable = true)
 |    |    |    |    |-- GlossDef: struct (nullable = true)
 |    |    |    |    |    |-- GlossSeeAlso: array (nullable = true)
 |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |-- para: string (nullable = true)
 |    |    |    |    |-- GlossSee: string (nullable = true)
 |    |    |    |    |-- GlossTerm: string (nullable = true)
 |    |    |    |    |-- ID: string (nullable = true)
 |    |    |    |    |-- SortAs: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |-- title: string (nullable = true)

然后选择要提取的字段:

json_df.select("glossary.title").show()
json_df.select("glossary.GlossDiv.GlossList.GlossEntry.*").select("Abbrev","Acronym","ID","SortAs").show()

提取的输出:

+----------------+
|           title|
+----------------+
|example glossary|
+----------------+

+-------------+-------+----+------+
|       Abbrev|Acronym|  ID|SortAs|
+-------------+-------+----+------+
|ISO 8879:1986|   SGML|SGML|  SGML|
+-------------+-------+----+------+