Question

鉴于以下代码已被充分理解：

val rdd = sc.parallelize(List(("A", List(1, 1)), 
                              ("B", List(2, 2, 2, 200)), 
                              ("C", List(3, 3)),
                              ("D", List(2, 2)),
                              ("A", List(1, 1, 1)),
                              ("B", List(1, 1, 1)),
                              ("P", List(1, 1, 1))                             
                         ),3)
rdd.flatMap(_._2).sum

以及：

val mapped =   rdd.mapPartitionsWithIndex{
                    (index, iterator) => {
                       println("Called in Partition -> " + index)
                       val myList = iterator.toList
                       // In a normal user case, we will do the
                       // the initialization(ex : initializing database)
                       // before iterating through each element
                       myList.map(x => x + " -> " + index).iterator

                    }
                 }
  mapped.collect()

然后，为了论证的缘故 - 也许是一个不好的例子，但不过，我该如何应用

rdd.flatMap(_._2).sum

同样与mapPartitions或mapPartitionsWithIndex一起使用？

我每次都会收到一个错误，因为我认为是Iterator。

这可能会为我带来几件事。我认为这根本不可能，但我想确认一下。

Answer 1

基于另一个问题，得到了线索。其实很简单。

这是：

rdd.mapPartitions(it => Iterator(it.flatMap(_._2).sum)).collect()

SPARK mapPartitions在Partition中总结

1 个答案: