我想要的是在一个分区中对多个元素进行分组,然后对每个分区中的分组元素进行一些操作。但我发现从分区到列表的转换失败了。请参阅以下示例:
import scala.collection.mutable.ArrayBuffer
val rdd = sc.parallelize(Seq("a","b","c","d","e"), 2)
val mapped = rdd.mapPartitions( partition =>
{
val total = partition.size
var first = partition.toList match
{
case Nil => "EMPTYLIST"
case _ => partition.toList.head
}
var finalResult = ArrayBuffer[String]()
finalResult += "1:"+first;
finalResult += "2:"+first;
finalResult += "3:"+first;
finalResult.iterator
})
mapped.collect()
结果:
Array [String] = Array(1:EMPTYLIST,2:EMPTYLIST,3:EMPTYLIST, 1:EMPTYLIST,2:EMPTYLIST,3:EMPTYLIST)
为什么partition.toList总是空的?
答案 0 :(得分:3)
partition 是一个迭代器,大小计数会消耗它,所以当你将它转换为List时,它已经是空的;要多次浏览分区,您可以将分区转换为开头的列表,然后在列表中执行您需要的操作:
val mapped = rdd.mapPartitions( partition =>
{
val partitionList = partition.toList
val total = partitionList.size
val first = partitionList match
{
case Nil => "EMPTYLIST"
case _ => partitionList.head
}
var finalResult = ArrayBuffer[String]()
finalResult += "1:"+first;
finalResult += "2:"+first;
finalResult += "3:"+first;
finalResult.iterator
})
mapped.collect
// res7: Array[String] = Array(1:a, 2:a, 3:a, 1:c, 2:c, 3:c)