I have a DataFrame of the following format:
item_id1: Long, item_id2: Long, similarity_score: Double
What I'm trying to do is to get top N highest similarity_score records for each item_id1. So, for example:
1 2 0.5
1 3 0.4
1 4 0.3
2 1 0.5
2 3 0.4
2 4 0.3
With top 2 similar items would give:
1 2 0.5
1 3 0.4
2 1 0.5
2 3 0.4
I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to implement it in Spark Scala.
Thank you.
答案 0 :(得分:1)
我建议使用窗函数:
df
.withColumn("rnk",row_number().over(Window.partitionBy($"item_id1").orderBy($"similarity_score")))
.where($"rank"<=2)
或者,您可以使用dense_rank
/ rank
代替row_number
,具体取决于如何处理相似性得分相等的情况。