Question

我有RDD[(VertexId, Double)]，我希望按_._2对其进行排序，并使用此RDD加入索引（排名）。因此，我可以通过filter得到一个元素及其等级。

目前我按sortBy对RDD进行排序，但我不知道如何以其排名加入RDD。所以我把它作为一个序列收集并用它的索引压缩它。但这并不高效。我想知道是否有一种更优雅的方式来做到这一点。

我现在使用的代码是：

val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
      .collect() // collect to master, this may be very expensive

    tmpRes.zip(tmpRes.indices) // zip with index

Answer 1

如果您想以任何方式将 n 第一元组带回驱动程序，那么也许您可以使用 takeOrdered（n，[ordering]）其中 n 是要带回的结果数和排序您想要使用的比较器。

否则，您可以使用 zipWithIndex 转换，将RDD[(VertexId, Double)]转换为具有正确索引的RDD[((VertexId, Double), Long)]（当然，您应该在排序后执行此操作）。

例如：

scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))

此致

Spark排序RDD并加入他们的排名

1 个答案: