spark partition.toList失败

时间:2017-06-14 23:45:46

标签: scala apache-spark

我想要的是在一个分区中对多个元素进行分组,然后对每个分区中的分组元素进行一些操作。但我发现从分区到列表的转换失败了。请参阅以下示例:

import scala.collection.mutable.ArrayBuffer
val rdd = sc.parallelize(Seq("a","b","c","d","e"), 2)
val mapped = rdd.mapPartitions( partition =>
   {
      val total = partition.size
          var first = partition.toList match
            {
             case Nil => "EMPTYLIST"
             case _ =>  partition.toList.head  
            }

  var finalResult = ArrayBuffer[String]()
  finalResult += "1:"+first;
  finalResult += "2:"+first;
  finalResult += "3:"+first;

  finalResult.iterator
})

mapped.collect()

结果:

  

Array [String] = Array(1:EMPTYLIST,2:EMPTYLIST,3:EMPTYLIST,   1:EMPTYLIST,2:EMPTYLIST,3:EMPTYLIST)

为什么partition.toList总是空的?

1 个答案:

答案 0 :(得分:3)

partition 是一个迭代器,大小计数会消耗它,所以当你将它转换为List时,它已经是空的;要多次浏览分区,您可以将分区转换为开头的列表,然后在列表中执行您需要的操作:

val mapped = rdd.mapPartitions( partition =>
    {
        val partitionList = partition.toList
        val total = partitionList.size
        val first = partitionList match
            {
              case Nil => "EMPTYLIST"
              case _ =>  partitionList.head  
            }

        var finalResult = ArrayBuffer[String]()
        finalResult += "1:"+first;
        finalResult += "2:"+first;
        finalResult += "3:"+first;

        finalResult.iterator
    })

mapped.collect
// res7: Array[String] = Array(1:a, 2:a, 3:a, 1:c, 2:c, 3:c)