我正在Pyspark写电影推荐代码。 ALS的Recommendation输出是movie_id列中的数组和rating列中的另一个数组。但是,当我尝试将列分别分解为临时数据帧,然后使用' user_id'加入它们时内心的'加入导致笛卡尔积。
user_recs_one = user_recs.where(user_recs.user_id == 1)
user_recs_one.show(truncate=False)
+-------+-------------------------------------------------------+
|user_id|recommendations |
+-------+-------------------------------------------------------+
|1 |[[1085, 6.1223927], [1203, 6.0752907], [745, 5.954721]]|
+-------+-------------------------------------------------------+
user_recs_one
DataFrame[user_id: int, recommendations: array<struct<movie_id:int,rating:float>>]
user_recs_one = user_recs_one.select("user_id", "recommendations.movie_id", "recommendations.rating")
user_recs_one.show(truncate=False)
+-------+-----------------+--------------------------------+
|user_id|movie_id |rating |
+-------+-----------------+--------------------------------+
|1 |[1085, 1203, 745]|[6.1223927, 6.0752907, 5.954721]|
+-------+-----------------+--------------------------------+
user_recs_one
DataFrame[user_id: int, movie_id: array<int>, rating: array<float>]
x = user_recs_one.select("user_id", F.explode(col("movie_id")).alias("movie_id"))
x.show()
+-------+--------+
|user_id|movie_id|
+-------+--------+
| 1| 1085|
| 1| 1203|
| 1| 745|
+-------+--------+
y = user_recs_one.select("user_id",
F.explode(col("rating")).alias("rating"))
y.show()
+-------+---------+
|user_id| rating|
+-------+---------+
| 1|6.1223927|
| 1|6.0752907|
| 1| 5.954721|
+-------+---------+
x.join(y, on='user_id', how='inner').show()
+-------+--------+---------+
|user_id|movie_id| rating|
+-------+--------+---------+
| 1| 1085|6.1223927|
| 1| 1085|6.0752907|
| 1| 1085| 5.954721|
| 1| 1203|6.1223927|
| 1| 1203|6.0752907|
| 1| 1203| 5.954721|
| 1| 745|6.1223927|
| 1| 745|6.0752907|
| 1| 745| 5.954721|
+-------+--------+---------+
答案 0 :(得分:0)
由于我的结果集非常小,这就是我最终实现的结果:
user_recs_one = user_recs_one.select("user_id", "recommendations.movie_id", "recommendations.rating")
user_recs_one.show(truncate=False)
+-------+-----------------+--------------------------------+
|user_id|movie_id |rating |
+-------+-----------------+--------------------------------+
|1 |[1085, 1203, 745]|[6.1223927, 6.0752907, 5.954721]|
+-------+-----------------+--------------------------------+
user_recs_one
DataFrame[user_id: int, movie_id: array<int>, rating: array<float>]
介绍序列ID:
为了加入推荐的电影和推荐的评分,我们需要引入一个额外的id列。为了确保id列中的值增加,我们使用monotonically_increasing_id()函数。如果数据帧中有多个分区,则保证此函数产生越来越多的数字,但不能保证产生连续增加的数字。因此,我们还将分解的数据帧重新分区为1个分区。
only_movies = user_recs_one.select("user_id", F.explode(col("movie_id")).alias("movie_id"))
only_movies = only_movies.repartition(1).withColumn('id', F.monotonically_increasing_id())
only_movies = only_movies.select('id', 'user_id', 'movie_id')
only_movies.show()
+---+-------+--------+
| id|user_id|movie_id|
+---+-------+--------+
| 0| 1| 1085|
| 1| 1| 1203|
| 2| 1| 745|
+---+-------+--------+
only_ratings = user_recs_one.select("user_id", F.explode(col("rating")).alias("rating"))
only_ratings = only_ratings.repartition(1).withColumn('id', F.monotonically_increasing_id())
only_ratings = only_ratings.select('id', 'user_id', 'rating')
only_ratings.show()
+---+-------+---------+
| id|user_id| rating|
+---+-------+---------+
| 0| 1|6.1223927|
| 1| 1|6.0752907|
| 2| 1| 5.954721|
+---+-------+---------+
only_movies.join(only_ratings.drop('user_id'), on='id', how='inner').drop('id').show()
+-------+--------+---------+
|user_id|movie_id| rating|
+-------+--------+---------+
| 1| 1085|6.1223927|
| 1| 1203|6.0752907|
| 1| 745| 5.954721|
+-------+--------+---------+