Question

I have a DataFrame of the following format:

item_id1: Long, item_id2: Long, similarity_score: Double

What I'm trying to do is to get top N highest similarity_score records for each item_id1. So, for example:

With top 2 similar items would give:

I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to implement it in Spark Scala.

Thank you.

Answer 1

我建议使用窗函数：

 df
  .withColumn("rnk",row_number().over(Window.partitionBy($"item_id1").orderBy($"similarity_score")))
  .where($"rank"<=2)

或者，您可以使用dense_rank / rank代替row_number，具体取决于如何处理相似性得分相等的情况。

Spark get top N highest score results for each (item1, item2, score)

1 个答案: