Question

进行重新分区时，火花会破坏惰性评估链，并触发我无法控制/捕获的错误。

//simulation of reading a stream from s3
def readFromS3(partition: Int) : Iterator[(Int, String)] = {
    Iterator.tabulate(3){idx => 
        // simulate an error only on partition 3 record 2
        (idx, if(partition == 3 && idx == 2) throw new RuntimeException("error") else s"elem $idx on partition $partition" )
    }    
}

val rdd = sc.parallelize(Seq(1,2,3,4))
            .mapPartitionsWithIndex((partitionIndex, iter) => readFromS3(partitionIndex))

// I can do whatever I want here

//this is what triggers the evaluation of the iterator 
val partitionedRdd = rdd.partitionBy(new HashPartitioner(2))

// I can do whatever I want here

//desperately trying to catch the exception 
partitionedRdd.foreachPartition{ iter => 
    try{
        iter.foreach(println)
    }catch{
        case _ => println("error caught")
    }
}

在发表评论之前，请注意：

这是我的实际应用程序的过分简化
我知道从s3读取内容可能会有所不同，因此我应该使用sc.textFile。我对此没有控制权，无法更改。
我了解问题出在哪里：进行分区时，spark破坏了惰性链评估并触发了错误。我必须这样做！
我不是说spark中有bug，spark需要评估改组记录
我只能做我想做的事情：
- 在s3读数和分区之间
- 分区后
我可以编写自己的自定义分区程序

鉴于上述限制，我可以解决此问题吗？有解决方案吗？

Answer 1

我唯一能找到的解决方案是拥有一个EvaluateAheadIterator（一种在iterator.next被调用之前评估缓冲区头的方法）

import scala.collection.AbstractIterator
import scala.util.control.NonFatal

class EvalAheadIterator[+A](iter : Iterator[A]) extends AbstractIterator[A] {

  private val bufferedIter  : BufferedIterator[A] = iter.buffered

  override def hasNext: Boolean =
    if(bufferedIter.hasNext){
      try{
          bufferedIter.head //evaluate the head and trigger potential exceptions
          true
      }catch{
          case NonFatal(e) =>
            println("caught exception ahead of time")
            false
      }
    }else{
        false
    }


  override def next() : A = bufferedIter.next()
}

现在，我们应该在mapPartition中应用EvalAheadIterator：

//simulation of reading a stream from s3
def readFromS3(partition: Int) : Iterator[(Int, String)] = {
    Iterator.tabulate(3){idx => 
        // simulate an error only on partition 3 record 2
        (idx, if(partition == 3 && idx == 2) throw new RuntimeException("error") else s"elem $idx on partition $partition" )
    }    
}

val rdd = sc.parallelize(Seq(1,2,3,4))
            .mapPartitionsWithIndex((partitionIndex, iter) => readFromS3(partitionIndex))
            .mapPartitions{iter => new EvalAheadIterator(iter)}

// I can do whatever I want here

//this is what triggers the evaluation of the iterator 
val partitionedRdd = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))

// I can do whatever I want here

//desperately trying to catch the exception 
partitionedRdd.foreachPartition{ iter => 
    try{
        iter.foreach(println)
    }catch{
        case _ => println("error caught")
    }
}

Spark分区打破了惰性评估链并触发了我无法捕获的错误

1 个答案: