PySpark-Json爆炸嵌套在Struct和struct数组中

时间:2020-04-01 15:09:48

标签: python pyspark

我正在尝试使用一些示例json解析嵌套的json。下面是打印模式

 |-- batters: struct (nullable = true)
 |    |-- batter: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- ppu: double (nullable = true)
 |-- topping: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- type: string (nullable = true)

试图炸碎面糊,单独打顶并结合在一起。

df_batter = df_json.select("batters.*")
df_explode1= df_batter.withColumn("batter", explode("batter")).select("batter.*")

df_explode2= df_json.withColumn("topping", explode("topping")).select("id", 
"type","name","ppu","topping.*")

无法合并两个数据框。

尝试使用单个查询

exploded1 = df_json.withColumn("batter", df_batter.withColumn("batter", 
explode("batter"))).withColumn("topping", explode("topping")).select("id", 
"type","name","ppu","topping.*","batter.*")

但是遇到错误。请帮我解决。谢谢

1 个答案:

答案 0 :(得分:1)

您基本上必须explode arrays 一起,使用arrays_zip返回一个结构的合并数组。尝试这个。我没有测试过,但应该可以。

from pyspark.sql import functions as F    
df_json.select("id","type","name","ppu","topping","batters.*")\
       .withColumn("zipped", F.explode(F.arrays_zip("batter","topping")))\
       .select("id","type","name","ppu","zipped.*").show()

您也可以one by one这样做:

from pyspark.sql import functions as F    
    df1=df_json.select("id","type","name","ppu","topping","batters.*")\
           .withColumn("batter", F.explode("batter"))\
           .select("id","type","name","ppu","topping","batter")
    df1.withColumn("topping", F.explode("topping")).select("id","type","name","ppu","topping.*","batter.*")