用于加载RDD或捕获错误并创建RDD的Scala代码模式?

时间:2014-08-14 01:49:30

标签: scala apache-spark

我想加载RDD,如果失败,则创建RDD。我认为下面的代码可行,但即使sc.textFile()在try块内,它仍然会失败。我错过了什么或如何正确地做到这一点?谢谢!

// look for my RDD, load or make it 
val rdddump = "hdfs://localhost/Users/data/hdfs/namenode/myRDD.txt"
val myRdd = try {
  sc.textFile(rdddump)
} catch {
  case _ : Throwable => {
    println("failed to load RDD from HDFS")
    val newRdd = [....code to make new RDD here...]
    newRdd.saveAsTextFile(rdddump)
    newRdd
  }
}

println(myRdd)
println("RDD count = " + myRdd.count)

和错误如下所示

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost/Users/data/hdfs/namenode/myRDD.txt
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097)
at org.apache.spark.rdd.RDD.count(RDD.scala:861)
...

1 个答案:

答案 0 :(得分:4)

您正在错误的位置捕获异常,堆栈跟踪显示清楚。调用sc.textFile除了声明某个操作和RDD之间的关系外什么都不做。例如,没有任何东西会触发计算,导致它检查是否存在输入。