使用嵌套数组原始Spark SQL爆炸数组

时间:2020-10-30 08:39:37

标签: json hive apache-spark-sql

我有以下json结构,它以 string 的形式存储在我的蜂巢表列中

[  
{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123002, "outer_field_1": blahblahblah,  "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
]

现在,我想将外部数组的每一行元素解析成一个表。每个json对象的字段都解析为每一列,同时仍将内部列表保留为字符串:


| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

现在,我知道了使用正则表达式创建自定义分隔符,在上面拆分然后使用横向视图爆炸的技巧,但是在这种情况下,还存在与正则表达式匹配的嵌套数组:Parse json arrays using HIVE < / p>

有关如何执行此操作的任何想法?如果可能,我想在原始Spark-SQL中执行此操作。没有UDF或Serdes。

我尝试过的事情:

  1. select explode(get_json_object(outer_list, "$[*]")) from wt_test;

不起作用,它表示功能爆炸的输入应该是数组或映射类型,而不是字符串

  1. select explode(split(substr(outer_list, 2, length(strategies)-2),",")) from wt_test;

这会将每个逗号分隔成一行,这不是我们想要的:

{"outer_id": 123000
"outer_field_1": blah
"inner_list": [{"inner_id": 456}
{"inner_id": 789}]}
... more rows ...

1 个答案:

答案 0 :(得分:0)

假设我确实理解正确,那么您有以下几点:

输入

{
   "some_id":1,
   "outer_list":'[{"outer_id": 123000, "outer_field_1": "blah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": "blahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": "blahblahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}]'
}

所需的输出:

| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

首先,您需要将字符串解析为用于定义模式的模式:

schema = ArrayType(
   StructType([StructField('outer_id', IntegerType()), 
               StructField('outer_field_1', StringType()), 
               StructField('inner_list', StringType())])
)

请注意,这是简单的版本,其中Inner_List只是作为字符串使用。

将该模式应用于您的数据框:

df = df.select(from_json('outer_list', schema).alias('test'))

现在您有一列包含一个数组:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|test                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[123000, blah, [{"inner_id":456},{"inner_id":789}]], [123001, blahblah, [{"inner_id":456},{"inner_id":789}]], [123002, blahblahblah, [{"inner_id":456},{"inner_id":789}]]]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

您现在可以爆炸了:

df.select(explode('test').alias('exploded')).select('exploded.*')

做到了:

+--------+-------------+-----------------------------------+
|outer_id|outer_field_1|inner_list                         |
+--------+-------------+-----------------------------------+
|123000  |blah         |[{"inner_id":456},{"inner_id":789}]|
|123001  |blahblah     |[{"inner_id":456},{"inner_id":789}]|
|123002  |blahblahblah |[{"inner_id":456},{"inner_id":789}]|
+--------+-------------+-----------------------------------+

现在,无论如何,在解析external_list时,您都可以从一开始就对inner_list进行相同的操作。但是您应该首先尝试一下,这里拥有您需要的一切。

不要忘记导入:

from pyspark.sql.functions import *
from pyspark.sql.types import *

SQL版本(如果输入以表json_test给出):

select exploded.* from 
   (select explode(
             from_json(
               outer_list, 
    "array<struct<outer_id:int,outer_field_1:string,inner_list:string>>"
       )
    ) as exploded from json_test
)