我在pyspark的购物平台上有一系列用户活动数据:
user_id | product_id | 活动(已浏览产品,已购买,已添加到购物车等)
事情是同一个(user_id,product_id)元组可以有多个事件类型。我想在同一行收集所有这些事件。
示例:
╔═════════════════════════════════════════════════╗
║ user_id | product_id | Event ║
╠═════════════════════════════════════════════════╣
║ 1 1 viewed ║
║ 1 1 purchased ║
║ 2 1 added ║
║ 2 2 viewed ║
║ 2 2 added ║
╚═════════════════════════════════════════════════╝
我想:
╔════════════════════════════════════════════════╗
║ user_id | product_id | Event ║
╠════════════════════════════════════════════════╣
║ 1 1 {viewed, purchased} ║
║ 2 1 {added} ║
║ 2 2 {viewed, added} ║
╚════════════════════════════════════════════════╝
答案 0 :(得分:0)
val grouped : RDD[((user_id, product_id), Iterable[Event])]= rdd.map(triplet => ((triplet._1, triplet._2), triplet._3)).groupByKey()
答案 1 :(得分:0)
如果您需要尝试Dataframe
,请查看以下内容: -
import pyspark.sql.functions as F
rdd = sc.parallelize([[1, 1, 'viewed'],[1, 1, 'purchased'],[2, 1, 'added'],[2, 2, 'viewed'],[2, 2, 'added']])
df = rdd.toDF(['user_id', 'product_id', 'Event'])
df.groupby(['user_id', 'product_id']).agg(F.collect_set("Event")).show()
如果喜欢关注rdd
,请查看:
rdd = sc.parallelize([[1, 1, 'viewed'],[1, 1, 'purchased'],[2, 1, 'added'],[2, 2, 'viewed'],[2, 2, 'added']])
rdd.groupBy(lambda x:(x[0],x[1])).map(lambda x:(x[0][0], x[0][1], map(lambda x:x[2], list(x[1])) )).collect()