Question

我有一个rdd。我想按某个属性对其进行分组，然后将每个组保存到一个单独的文件中（并获取结果文件名列表）。最幼稚的方式：

val rdd : RDD[Long] = ???
val byLastDigit: RDD[(Int, Long)] = rdd.map(n => ((n % 10).toInt, n))
val saved: Array[String] = byLastDigit.groupByKey().map((numbers: (Int, Iterable[Long])) => {  
   //save numbers into a file
   ???
}).collect()

这种方法的缺点是它同时将一个键的所有值保存在内存中。因此，它将无法在庞大的数据集上运行。

替代方法：

byLastDigit.partitionBy(new HashPartitioner(1000)).mapPartitions((numbers: Iterator[(Int, Long)]) => {
    //assume that all numbers in a partition have the same key
    ???
  }).collect()

由于分区的数量比每个分区的键数高得多，因此每个分区很可能只保留一个键的数字。

它可以平滑地处理庞大的数据集。但这很丑陋，而且容易出错。

能做得更好吗？

Spark：groupByKey带有“ Iterator”而不是右侧的“ Iterable”

0 个答案: