Question

我遇到了一个情况：当我使用

val a = rdd.pipe("./my_cpp_program").persist() 
a.count()  // just use it to persist a 
val b = a.map(s => (s, 1)).reduceByKey().count()

它如此之快

但是当我使用

时

val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()

它太慢了...... 我的遗嘱执行人中有很多这样的记录：

15/10/31 19:53:58 INFO collection.ExternalSorter: Thread 78 spilling in-memory map of 633.1 MB to disk (8 times so far) 
15/10/31 19:54:14 INFO collection.ExternalSorter: Thread 74 spilling in-memory map of 633.1 MB to disk (8 times so far) 
15/10/31 19:54:17 INFO collection.ExternalSorter: Thread 79 spilling in-memory map of 633.1 MB to disk (8 times so far) 
15/10/31 19:54:29 INFO collection.ExternalSorter: Thread 77 spilling in-memory map of 633.1 MB to disk (8 times so far) 
15/10/31 19:54:50 INFO collection.ExternalSorter: Thread 76 spilling in-memory map of 633.1 MB to disk (9 times so far)

Answer 1

您尚未将功能传递给reduceByKey()。来自reduceByKey的文档：

当调用（K，V）对的数据集时，返回（K，V）对的数据集，其中使用给定的reduce函数func 聚合每个键的值，这必须是（V，V）=＆gt;类型V.与groupByKey类似，reduce任务的数量可通过可选的第二个参数进行配置。

在这种情况下，您希望传递匿名函数(a, b) => a + b来聚合键的值（也可以使用Scala缩短的下划线表示法写为_ + _）。

由于您正在调用count()（基本上会计算reduceByKey()之后的唯一键数），因此您只需使用distinct()即可。 distinct的实现实际上与您当前尝试的非常相似（映射到(s, null)然后调用reduceByKey）但是从代码可读性的角度来看，不同的将更好地指示您的最终目标是什么。像这样的东西会起作用：

val b = rdd.pipe("./my_cpp_program").distinct().count()

由于您实际上也可能对每个唯一键的计数感兴趣，因此PairRDDFunctions类中还有其他功能可以帮助解决这个问题。我会查看countByKey()，countByKeyApprox()和countApproxDistinctByKey()。每个都有不同的用例，但为各自的问题提供了有趣的解决方案。

使用pipe（）和reduceByKey（）

1 个答案: