Pyspark访问和分解JSON的嵌套项

时间:2019-02-21 16:16:34

标签: python json pyspark

我是火花的新手,我试图解析一个包含要聚合数据的json文件,但我无法导航它的内容。 我在寻找其他解决方案,但找不到任何适合我的情况。

这是导入的json的数据框的架构:

root
  |-- UrbanDataset: struct (nullable = true)
  |    |-- context: struct (nullable = true)
  |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |-- format: string (nullable = true)
  |    |    |    |-- height: long (nullable = true)
  |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |-- longitude: double (nullable = true)
  |    |    |-- language: string (nullable = true)
  |    |    |-- producer: struct (nullable = true)
  |    |    |    |-- id: string (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |-- timeZone: string (nullable = true)
  |    |    |-- timestamp: string (nullable = true)
  |    |-- specification: struct (nullable = true)
  |    |    |-- id: struct (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |    |-- value: string (nullable = true)
  |    |    |-- name: string (nullable = true)
  |    |    |-- properties: struct (nullable = true)
  |    |    |    |-- propertyDefinition: array (nullable = true)
  |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |-- codeList: string (nullable = true)
  |    |    |    |    |    |-- dataType: string (nullable = true)
  |    |    |    |    |    |-- propertyDescription: string (nullable = true)
  |    |    |    |    |    |-- propertyName: string (nullable = true)
  |    |    |    |    |    |-- subProperties: struct (nullable = true)
  |    |    |    |    |    |    |-- propertyName: array (nullable = true)
  |    |    |    |    |    |    |    |-- element: string (containsNull = true)
  |    |    |    |    |    |-- unitOfMeasure: string (nullable = true)
  |    |    |-- uri: string (nullable = true)
  |    |    |-- version: string (nullable = true)
  |    |-- values: struct (nullable = true)
  |    |    |-- line: array (nullable = true)
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |    |    |-- format: string (nullable = true)
  |    |    |    |    |    |-- height: double (nullable = true)
  |    |    |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |    |    |-- longitude: double (nullable = true)
  |    |    |    |    |-- id: long (nullable = true)
  |    |    |    |    |-- period: struct (nullable = true)
  |    |    |    |    |    |-- end_ts: string (nullable = true)
  |    |    |    |    |    |-- start_ts: string (nullable = true)
  |    |    |    |    |-- property: array (nullable = true)
  |    |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |    |-- name: string (nullable = true)
  |    |    |    |    |    |    |-- val: string (nullable = true)

整个json的子集被附加here

我的目标是从此架构中检索 values 结构,并操作/汇总位于line.element.property.element.val

中的所有 val

我也尝试将其分解以获取“ csv样式”列中的每个字段,但出现错误:

  

pyspark.sql.utils.AnalysisException:u”无法解析'array({UrbanDatasetcontextUrbanDatasetspecificationUrbanDataset。{{ 1}})'由于数据类型不匹配:函数数组的输入应全部为同一类型

values

谢谢

1 个答案:

答案 0 :(得分:1)

您无法访问直接嵌套的数组,需要先使用explode。 它将为数组中的每个元素创建一行。

from pyspark.sql import functions as F
df.withColumn("Value", F.explode("Values"))