我有一个基本问题要问,如何在蜂巢struct
类型中读取spark数据帧。例如,我有一个如下所示的配置单元表:
user_id (string)
current_address (struct<city:string,state:string>)
previous_address (array<struct<city:string,state:string>>)
+--------------+------------------------------+-----------------------------------------------------------------+
| user_id | current_address | previous_address |
+--------------+------------------------------+-----------------------------------------------------------------+
| 1 |{"city":"Tampa","state":"FL"} | [{"city":"Newark","state":"NJ"},{"city":"Denver","state":"CO"}] |
+--------------+------------------------------+-----------------------------------------------------------------+
| 2 |{"city":"NY","state":"NY"} | [{"city":"Austin","state":"TX"}] |
+--------------+------------------------------+-----------------------------------------------------------------+
SparkSQL将其读取为数据框,如下所示:
root
|-- user_id: string (nullable = true)
|-- current_address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- previous_address: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
+--------------+-----------------+--------------------------+
| user_id | current_address | previous_address |
+--------------+-----------------+------------ +------------+
| 1 |[Tampa,FL] |[[Newark,NJ],[Denver,CO]] |
+--------------+-----------------+--------------------------+
| 2 |[NY,NY] | [[Austin,TX]] |
+--------------+-----------------+--------------------------+
看起来像蜂巢struct
类型的
作为数组读取。以后的计划是将数据框转换为map并使用键及其值进行其他操作
如何使spark读取这些结构字段(即current_address
,previous_address
作为像Map
这样的键值,而不是像数组和数组数组那样的键值,所以我将以类似{{ 1}}和Map[String, String]
而不是WrappedArray[Map[String, String]]
和Array[String]
?