Question

我有RDD[(String,(Int, Int)]，我需要在排序后为每个键获得前10个值（元组）。我试过了：

val sortedRDD = rdd.groupByKey.mapValues( x => x.toList.sortWith((x,y) => <<sorting logic>>).take(10))

此OutOfMemoryException引发Iterable[(Int, Int)]，因为某些键的键很少，因此.groupByKey()很大。我应该如何处理？，有没有办法在不使用if (ContextCompat.checkSelfPermission(getContext(), Manifest.permission.CAMERA) != PackageManager.PERMISSION_GRANTED) { if (ActivityCompat.shouldShowRequestPermissionRationale((Activity) getContext(), Manifest.permission.CAMERA)) { } else { ActivityCompat.requestPermissions((Activity) getContext(), new String[]{Manifest.permission.CAMERA}, MY_PERMISSIONS_REQUEST_CAMERA); } }的情况下执行此操作。

Answer 1

您应该使用aggregateByKey代替groupByKey来执行排序和“修剪”（仅保留前10名），同时将分组，而不是分组到可能很大的群组中然后才映射结果。

这是看起来如何：

// your sorting logic:
val sortingFunction: ((Int, Int), (Int, Int)) => Boolean = ???

val N = 10

val sortedRDD = rdd.aggregateByKey(List[(Int, Int)]())(
  // first function: seqOp, how to add another item of the group to the result
  {
    case (topSoFar, candidate) if topSoFar.size < N => candidate :: topSoFar
    case (topTen, candidate) => (candidate :: topTen).sortWith(sortingFunction).take(N)
  },
  // second function: combOp, how to add combine two partial results created by seqOp
  { case (list1, list2) => (list1 ++ list2).sortWith(sortingFunction).take(N) }
)

请注意，对于每个组，我们始终创建10个或更少项的值。

注意：通过执行较少的“排序”操作可以提高性能（每当我们添加另一个项目/列表时，我们会一次又一次地对相同的列表进行排序）。要解决这个问题，您可以考虑使用容量有限的“排序集”（请参阅Limited SortedSet）作为值，以便每次添加都可以有效地添加或丢弃新值而无需排序。

在Spark中进行分组后优化排序可迭代值

1 个答案: