pyspark json为具有零个或多个元素的数组进行爆炸

时间:2018-06-05 20:19:12

标签: apache-spark pyspark

我有一些json数据,其数组可以包含零个或多个元素。下面是数据。当我爆炸数组时,零元素的行将被删除。在这种情况下名称:安迪即将被放弃。

>>> d1 = [{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]},{"name":"Andy","schools":[]}]
>>> df1= sqlContext.createDataFrame(d1)
>>> df2 = df1.withColumn('school_details', func.explode(df1.schools))
>>> df3 = df2.select(df2.name, df2.school_details.sname,df2.school_details.year)
>>> df3.show()
+-------+---------------------+--------------------+
|   name|school_details[sname]|school_details[year]|
+-------+---------------------+--------------------+
|Michael|             stanford|                2010|
|Michael|             berkeley|                2012|
+-------+---------------------+--------------------+

如何获得如下所有记录。

预期结果

+-------+---------------------+--------------------+
|   name|school_details[sname]|school_details[year]|
+-------+---------------------+--------------------+
|Michael|             stanford|                2010|
|Michael|             berkeley|                2012|
|Andy   |             null    |                null|
+-------+---------------------+--------------------+

0 个答案:

没有答案