如何聚合pyspark爆炸以添加新列

时间:2020-02-24 15:29:47

标签: pyspark aggregate-functions pyspark-dataframes

我有一个带有列的spark df,该列具有Type:Value字段数组。我可以对此进行分解,以使每个type:value对将类型和值分隔为一行,现在想聚合回来,这样我就得到了带有一系列列的单行(对于每个entity_id),其中列名是类型和列值是值。

df.show(5)

+----------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity_id       |_tags
+----------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|5bdb7c3...8a17f9|[Row(type='cond1', value='a=1'),Row(type='cond2', value='a=2'),Row(type='cond3', value='a=3'),Row(type='cond4', value='a=4')]                                                                                                                                                                                  |

爆炸(tags_exploded=df.select(f.col("entity_id"),f.explode(f.col("_tags"))))后,我得到:

tags_exploded.show(5,False)

(1) Spark Jobs
+----------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity_id       |type                 |value                                                                                                                                                                                 |
+----------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|5bdb7c3...8a17f9|cond1                |a=1                                                                                                                                                                                    |
|5bdb7c3...8a17f9|cond2                |a=2 
|5bdb7c3...8a17f9|cond3                |a=3                                                                                                                                                                           |
|5bdb7c3...8a17f9|cond4                |a=4

我想要的结果是:

+--------------------+---------+---------+------ --+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |entity_id       |cond1    |cond2    |cond3    |cond4
+--------------------+---------+---------+------ --+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |5bdb7c3...8a17f9|a=1      |a=2      |a=3      |a=4                                                                                                                                                                                 |

如何聚集爆炸以得到所需的结果-或者从原始数组中提取字段以获取相同的所需结果?首先,我考虑所有最终列都出现在原始df的每一行中的情况(即每个实体都有cond1,cond2,cond3,cond4))

0 个答案:

没有答案