我是火花的新手,我试图解析一个包含要聚合数据的json文件,但我无法导航它的内容。 我在寻找其他解决方案,但找不到任何适合我的情况。
这是导入的json的数据框的架构:
root
|-- UrbanDataset: struct (nullable = true)
| |-- context: struct (nullable = true)
| | |-- coordinates: struct (nullable = true)
| | | |-- format: string (nullable = true)
| | | |-- height: long (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- language: string (nullable = true)
| | |-- producer: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| | |-- timestamp: string (nullable = true)
| |-- specification: struct (nullable = true)
| | |-- id: struct (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | | |-- value: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- propertyDefinition: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- codeList: string (nullable = true)
| | | | | |-- dataType: string (nullable = true)
| | | | | |-- propertyDescription: string (nullable = true)
| | | | | |-- propertyName: string (nullable = true)
| | | | | |-- subProperties: struct (nullable = true)
| | | | | | |-- propertyName: array (nullable = true)
| | | | | | | |-- element: string (containsNull = true)
| | | | | |-- unitOfMeasure: string (nullable = true)
| | |-- uri: string (nullable = true)
| | |-- version: string (nullable = true)
| |-- values: struct (nullable = true)
| | |-- line: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- coordinates: struct (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- height: double (nullable = true)
| | | | | |-- latitude: double (nullable = true)
| | | | | |-- longitude: double (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- period: struct (nullable = true)
| | | | | |-- end_ts: string (nullable = true)
| | | | | |-- start_ts: string (nullable = true)
| | | | |-- property: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- name: string (nullable = true)
| | | | | | |-- val: string (nullable = true)
整个json的子集被附加here
我的目标是从此架构中检索 values 结构,并操作/汇总位于line.element.property.element.val
中的所有 val我也尝试将其分解以获取“ csv样式”列中的每个字段,但出现错误:
pyspark.sql.utils.AnalysisException:u”无法解析'array({
UrbanDataset
。context
,UrbanDataset
。specification
,UrbanDataset
。{{ 1}})'由于数据类型不匹配:函数数组的输入应全部为同一类型
values
谢谢
答案 0 :(得分:1)
您无法访问直接嵌套的数组,需要先使用explode
。
它将为数组中的每个元素创建一行。
from pyspark.sql import functions as F
df.withColumn("Value", F.explode("Values"))