我有以下json结构,它以 string 的形式存储在我的蜂巢表列中
[
{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},
{"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},
{"outer_id": 123002, "outer_field_1": blahblahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},
]
现在,我想将外部数组的每一行元素解析成一个表。每个json对象的字段都解析为每一列,同时仍将内部列表保留为字符串:
| outer_id | outer_field_1 | inner_list | | | |----------|---------------|-------------|---|---| | 123000 | blah | struct | | | | 123001 | blahblah | struct | | | | 123002 | blahblahblah | struct | | |
现在,我知道了使用正则表达式创建自定义分隔符,在上面拆分然后使用横向视图爆炸的技巧,但是在这种情况下,还存在与正则表达式匹配的嵌套数组:Parse json arrays using HIVE < / p>
有关如何执行此操作的任何想法?如果可能,我想在原始Spark-SQL中执行此操作。没有UDF或Serdes。
我尝试过的事情:
select explode(get_json_object(outer_list, "$[*]")) from wt_test;
不起作用,它表示功能爆炸的输入应该是数组或映射类型,而不是字符串
select explode(split(substr(outer_list, 2, length(strategies)-2),",")) from wt_test;
这会将每个逗号分隔成一行,这不是我们想要的:
{"outer_id": 123000
"outer_field_1": blah
"inner_list": [{"inner_id": 456}
{"inner_id": 789}]}
... more rows ...
答案 0 :(得分:0)
假设我确实理解正确,那么您有以下几点:
{
"some_id":1,
"outer_list":'[{"outer_id": 123000, "outer_field_1": "blah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": "blahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": "blahblahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}]'
}
| outer_id | outer_field_1 | inner_list | | |
|----------|---------------|-------------|---|---|
| 123000 | blah | struct | | |
| 123001 | blahblah | struct | | |
| 123002 | blahblahblah | struct | | |
首先,您需要将字符串解析为用于定义模式的模式:
schema = ArrayType(
StructType([StructField('outer_id', IntegerType()),
StructField('outer_field_1', StringType()),
StructField('inner_list', StringType())])
)
请注意,这是简单的版本,其中Inner_List只是作为字符串使用。
将该模式应用于您的数据框:
df = df.select(from_json('outer_list', schema).alias('test'))
现在您有一列包含一个数组:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|test |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[123000, blah, [{"inner_id":456},{"inner_id":789}]], [123001, blahblah, [{"inner_id":456},{"inner_id":789}]], [123002, blahblahblah, [{"inner_id":456},{"inner_id":789}]]]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
您现在可以爆炸了:
df.select(explode('test').alias('exploded')).select('exploded.*')
做到了:
+--------+-------------+-----------------------------------+
|outer_id|outer_field_1|inner_list |
+--------+-------------+-----------------------------------+
|123000 |blah |[{"inner_id":456},{"inner_id":789}]|
|123001 |blahblah |[{"inner_id":456},{"inner_id":789}]|
|123002 |blahblahblah |[{"inner_id":456},{"inner_id":789}]|
+--------+-------------+-----------------------------------+
现在,无论如何,在解析external_list时,您都可以从一开始就对inner_list进行相同的操作。但是您应该首先尝试一下,这里拥有您需要的一切。
不要忘记导入:
from pyspark.sql.functions import *
from pyspark.sql.types import *
select exploded.* from
(select explode(
from_json(
outer_list,
"array<struct<outer_id:int,outer_field_1:string,inner_list:string>>"
)
) as exploded from json_test
)