Spark分区打破了惰性评估链并触发了我无法捕获的错误

时间:2019-03-22 20:39:44

标签: scala apache-spark apache-spark-2.0

进行重新分区时,火花会破坏惰性评估链,并触发我无法控制/捕获的错误。

//simulation of reading a stream from s3
def readFromS3(partition: Int) : Iterator[(Int, String)] = {
    Iterator.tabulate(3){idx => 
        // simulate an error only on partition 3 record 2
        (idx, if(partition == 3 && idx == 2) throw new RuntimeException("error") else s"elem $idx on partition $partition" )
    }    
}

val rdd = sc.parallelize(Seq(1,2,3,4))
            .mapPartitionsWithIndex((partitionIndex, iter) => readFromS3(partitionIndex))

// I can do whatever I want here

//this is what triggers the evaluation of the iterator 
val partitionedRdd = rdd.partitionBy(new HashPartitioner(2))

// I can do whatever I want here

//desperately trying to catch the exception 
partitionedRdd.foreachPartition{ iter => 
    try{
        iter.foreach(println)
    }catch{
        case _ => println("error caught")
    }
}

在发表评论之前,请注意:

  1. 这是我的实际应用程序的过分简化
  2. 我知道从s3读取内容可能会有所不同,因此我应该使用sc.textFile。我对此没有控制权,无法更改。
  3. 我了解问题出在哪里:进行分区时,spark破坏了惰性链评估并触发了错误。我必须这样做!
  4. 我不是说spark中有bug,spark需要评估改组记录
  5. 我只能做我想做的事情:
    • 在s3读数和分区之间
    • 分区后
  6. 我可以编写自己的自定义分区程序

鉴于上述限制,我可以解决此问题吗?有解决方案吗?

1 个答案:

答案 0 :(得分:0)

我唯一能找到的解决方案是拥有一个EvaluateAheadIterator(一种在iterator.next被调用之前评估缓冲区头的方法)

import scala.collection.AbstractIterator
import scala.util.control.NonFatal

class EvalAheadIterator[+A](iter : Iterator[A]) extends AbstractIterator[A] {

  private val bufferedIter  : BufferedIterator[A] = iter.buffered

  override def hasNext: Boolean =
    if(bufferedIter.hasNext){
      try{
          bufferedIter.head //evaluate the head and trigger potential exceptions
          true
      }catch{
          case NonFatal(e) =>
            println("caught exception ahead of time")
            false
      }
    }else{
        false
    }


  override def next() : A = bufferedIter.next()
}

现在,我们应该在mapPartition中应用EvalAheadIterator:

//simulation of reading a stream from s3
def readFromS3(partition: Int) : Iterator[(Int, String)] = {
    Iterator.tabulate(3){idx => 
        // simulate an error only on partition 3 record 2
        (idx, if(partition == 3 && idx == 2) throw new RuntimeException("error") else s"elem $idx on partition $partition" )
    }    
}

val rdd = sc.parallelize(Seq(1,2,3,4))
            .mapPartitionsWithIndex((partitionIndex, iter) => readFromS3(partitionIndex))
            .mapPartitions{iter => new EvalAheadIterator(iter)}

// I can do whatever I want here

//this is what triggers the evaluation of the iterator 
val partitionedRdd = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))

// I can do whatever I want here

//desperately trying to catch the exception 
partitionedRdd.foreachPartition{ iter => 
    try{
        iter.foreach(println)
    }catch{
        case _ => println("error caught")
    }
}