Question

我在 RDD 中有（Int，（Int，Int，Int）的元素。目的是将具有相同键的元素限制为某个阈值 t 。更简单的解决方案如下：

rdd.groupByKey().flatMapValues{iterable => {
  iterable.take(t)
}}

我认为通过 combineByKey 替换这段代码会很有用，以便通过组合器使用地图侧聚合，因为可能会超过 t < / em>单个分区中的相同键，导致以下内容：

val function_createCombiner = (x: (Int, Int, Int)) => { ArrayBuffer[(Int, Int, Int)](x) } val function_mergeValue = (accumulator: ArrayBuffer[(Int, Int, Int)], x: (Int, Int, Int)) => { if(accumulator.size < t){ accumulator += x } accumulator } val function_mergeCombiners = (accumulator1: ArrayBuffer[(Int, Int, Int)], accumulator2: ArrayBuffer[(Int, Int, Int)]) => { val iter = accumulator2.iterator var saturated = false while(!saturated && iter.hasNext){ if(accumulator1.length < t){ accumulator1 += iter.next() } else { saturated = true } } accumulator1 } rdd .combineByKey(function_createCombiner, function_mergeValue, function_mergeCombiners) .flatMapValues(x => x.toList)

令人惊讶的是， combineByKey 解决方案比 groupByKey 解决方案表现更差。 GC为 combineByKey 解决方案的50％的时间工作，所以我认为我创建了许多临时缓冲区。另一方面，互联网上都有说明，应该尽量避免使用groupByKey。

CombineByKey 时间：11分钟

GroupByKey 时间：4,1分钟

我的 combineByKey 解决方案中是否存在一些可怕的缺陷？或者我会错过其他什么？

提前致谢！

编辑：这个问题实际上是重复的，对不起。这是因为，只有极少量的元素出现的次数超过t次。因此很明显，我（几乎）尝试通过 combineByKey 重新实现 groupByKey 。唯一的选择是我使用 groupByKey ，它似乎更快，或者如果可能的话完全省略步骤。无论如何，谢谢你的帮助！

Answer 1

就个人而言，在使用reduceByKey :)

时，我会使用rdds

rdd
 .mapValues(List(_))
 .reduceByKey((v1, v2) => (v1++v2).take(t))
 .flatMapValues(identity(_))

我觉得它比combineByKey容易得多，而且它的通常比groupByKey更有效率，因为它会在重新排列数据之前映射侧面减少。我说通常是，因为有些情况（例如，当您收集每个键的所有值时）groupByKey执行的内容以及reduceByKey。

GroupByKey比CombineByKey快

1 个答案: