Spark Streaming Job无法恢复

时间:2017-06-23 08:45:08

标签: apache-spark spark-streaming

我正在使用一个使用mapWithState和初始RDD的spark流式传输作业。重新启动应用程序并从检查点恢复时,它失败并显示错误:

此RDD缺少SparkContext。它可能发生在以下情况:

  1. RDD转换和操作不是由驱动程序调用,而是在其他转换内部调用;例如,rdd1.map(x => rdd2.values.count()* x)无效,因为无法在rdd1.map转换内执行值转换和计数操作。有关详细信息,请参阅SPARK-5063。
  2. 当Spark Streaming作业从检查点恢复时,如果在DStream操作中使用了未由流作业定义的RDD的引用,则会触发此异常。有关更多信息,请参阅SPARK-13​​758
  3. https://issues.apache.org/jira/browse/SPARK-13758中描述了这种行为,但它并没有真正描述如何解决它。我的RDD不是由流媒体工作定义的,但我仍然需要在州内。

    这是我的图表的示例:

    class EventStreamingApplication {
      private val config: Config = ConfigFactory.load()
      private val sc: SparkContext = {
        val conf = new SparkConf()
          .setAppName(config.getString("streaming.appName"))
          .set("spark.cassandra.connection.host", config.getString("streaming.cassandra.host"))
        val sparkContext = new SparkContext(conf)
        System.setProperty("com.amazonaws.services.s3.enableV4", "true")
        sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
        sparkContext
      }
    
      def run(): Unit = {
        // streaming.eventCheckpointDir is an S3 Bucket
        val ssc: StreamingContext = StreamingContext.getOrCreate(config.getString("streaming.eventCheckpointDir"), createStreamingContext)
        ssc.start()
        ssc.awaitTermination()
      }
    
      def receiver(ssc: StreamingContext): DStream[Event] = {
        RabbitMQUtils.createStream(ssc, Map(
          "hosts" -> config.getString("streaming.rabbitmq.host"),
          "virtualHost" -> config.getString("streaming.rabbitmq.virtualHost"),
          "userName" -> config.getString("streaming.rabbitmq.user"),
          "password" -> config.getString("streaming.rabbitmq.password"),
          "exchangeName" -> config.getString("streaming.rabbitmq.eventExchange"),
          "exchangeType" -> config.getString("streaming.rabbitmq.eventExchangeType"),
          "queueName" -> config.getString("streaming.rabbitmq.eventQueue")
        )).flatMap(EventParser.apply)
      }
    
      def setupStreams(ssc: StreamingContext): Unit = {
        val events = receiver(ssc)
        ExampleJob(events, sc)
      }
    
      private def createStreamingContext(): StreamingContext = {
        val ssc = new StreamingContext(sc, Seconds(config.getInt("streaming.batchSeconds")))
        setupStreams(ssc)
        ssc.checkpoint(config.getString("streaming.eventCheckpointDir"))
        ssc
      }
    }
    
    case class Aggregation(value: Long) // Contains aggregation values
    
    object ExampleJob {
      def apply(events: DStream[Event], sc: SparkContext): Unit = {
        val aggregations: RDD[(String, Aggregation)] = sc.cassandraTable('...', '...').map(...) // some domain class mapping
        val state = StateSpec
          .function((key, value, state) => {
            val oldValue = state.getOption().map(_.value).getOrElse(0)
            val newValue = oldValue + value.getOrElse(0)
            state.update(Aggregation(newValue))
            state.get
          })
          .initialState(aggregations)
          .numPartitions(1)
          .timeout(Seconds(86400))
        events
          .filter(...) // filter out unnecessary events
          .map(...) // domain class mapping to key, event dstream
          .groupByKey()
          .map(i => (i._1, i._2.size.toLong))
          .mapWithState(state)
          .stateSnapshots()
          .foreachRDD(rdd => {
            rdd.saveToCassandra(...)
          })
      }
    }
    

    抛出的堆栈跟踪是:

    Exception in thread "main" org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases: 
    (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
    (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
      at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89)
      at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
      at org.apache.spark.rdd.PairRDDFunctions.partitionBy(PairRDDFunctions.scala:534)
      at org.apache.spark.streaming.rdd.MapWithStateRDD$.createFromPairRDD(MapWithStateRDD.scala:193)
      at org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:146)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
      at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
      at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
      at scala.Option.orElse(Option.scala:289)
      at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
      at org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
      at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
      at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
      at scala.Option.orElse(Option.scala:289)
      ...
      <991 lines omitted>
      ...
      at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
      at org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
      at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
      at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
      at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
      at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
      at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:577)
      at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:571)
      at com.example.spark.EventStreamingApplication.run(EventStreamingApplication.scala:31)
      at com.example.spark.EventStreamingApplication$.main(EventStreamingApplication.scala:63)
      at com.example.spark.EventStreamingApplication.main(EventStreamingApplication.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:497)
      at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
      at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
      at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
      at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
      at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    

1 个答案:

答案 0 :(得分:0)

似乎在火花试图恢复时,没有选择正确的最新检查点文件。因为这个不正确的RDD被引用。

似乎火花版本2.1.1受到影响,因为它不在固定版本列表中。

请参阅以下链接,了解尚未指定修复版本的apache文档。

https://issues.apache.org/jira/browse/SPARK-19280

在我看来,您可以尝试探索自动/手动解决方案,您可以在重新启动火花作业时指定最新的检查点文件。

我知道这没什么用处,但我认为最好先解释一下这个问题的根本原因以及解决这个问题的当前发展以及我对可能解决方案的看法。