Question

我的大小为RDD[LabeledPoint]。

我想将它转换为RDD[Array[LabeledPoint]]，使所有数组的大小大致相同（如果需要，则除了一个更小）。

我找到了here一个方法（对于RDD[Double]）迭代RRD的分区：

val batchedRDD = rdd.mapPartitions { iter: Iterator[Int] =>
  new Iterator[Array[Int]] {
    def hasNext: Boolean = iter.hasNext
    def next(): Array[Int] = {
      iter.take(batchedDegree).toArray
    }
  }
}

然而，在实践中，由于此方法是分区方法，因此会创建大量小于（大于）所需大小的数组。

我考虑使用coalesce来减少分区数量，从而减少较小数组的数量。但这可能会降低我工作后期的速度。

您是否有其他想法以更好的方式转换RDD？

Answer 1

您可以使用rdd.glom()。

来自Scala文档：

/ **
  *返回通过合并每个元素内的所有元素创建的RDD   分成数组。
  * /
 def glom(): RDD[Array[T]] = withScope
 {
     new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))   
 }

＆＃34;取消弄平＆＃34;火花中的RDD

1 个答案: