在Pyspark RDD中找到独特的元组

时间:2017-03-06 23:13:45

标签: python apache-spark mapreduce pyspark

我在pyspark的购物平台上有一系列用户活动数据:

user_id | product_id | 活动(已浏览产品,已购买,已添加到购物车等)

事情是同一个(user_id,product_id)元组可以有多个事件类型。我想在同一行收集所有这些事件。

示例:

╔═════════════════════════════════════════════════╗
║ user_id    |  product_id             |   Event  ║
╠═════════════════════════════════════════════════╣
║ 1               1                     viewed    ║
║ 1               1                     purchased ║
║ 2               1                     added     ║
║ 2               2                     viewed    ║
║ 2               2                     added     ║
╚═════════════════════════════════════════════════╝

我想:

╔════════════════════════════════════════════════╗
║ user_id | product_id |      Event              ║
╠════════════════════════════════════════════════╣
║ 1          1          {viewed, purchased}      ║
║ 2          1          {added}                  ║
║ 2          2          {viewed, added}          ║
╚════════════════════════════════════════════════╝

2 个答案:

答案 0 :(得分:0)

在Scala中,它应该如下所示:

val grouped : RDD[((user_id, product_id), Iterable[Event])]= rdd.map(triplet => ((triplet._1, triplet._2), triplet._3)).groupByKey()

答案 1 :(得分:0)

如果您需要尝试Dataframe,请查看以下内容: -

import pyspark.sql.functions as F
rdd = sc.parallelize([[1, 1, 'viewed'],[1, 1, 'purchased'],[2, 1, 'added'],[2, 2, 'viewed'],[2, 2, 'added']])
df = rdd.toDF(['user_id', 'product_id', 'Event'])
df.groupby(['user_id', 'product_id']).agg(F.collect_set("Event")).show()

如果喜欢关注rdd,请查看:

rdd = sc.parallelize([[1, 1, 'viewed'],[1, 1, 'purchased'],[2, 1, 'added'],[2, 2, 'viewed'],[2, 2, 'added']])
rdd.groupBy(lambda x:(x[0],x[1])).map(lambda x:(x[0][0], x[0][1], map(lambda x:x[2], list(x[1])) )).collect()