Question

假设我有一个数据帧：

product_id  customer
1 1
1 2
1 4
2 1
2 2

我想将上述数据框分组为：

product_id customers
1 [1,2,4]
2 [1,2]

我怎么能用PySpark做到这一点？

Answer 1

希望这有帮助！

import pyspark.sql.functions as f 
df.groupby("product_id").agg(f.collect_list("customer").alias("customers")).show()

（编辑注释 - 在代码中添加了import语句）