Spark Streaming - 从Checkpoint

时间:2017-04-03 19:00:39

标签: apache-spark spark-streaming

我们正在构建一个容错系统,可以从Kafka读取并编写HBase& HDFS。批次每5秒运行一次。这是我们希望设置的场景:

  1. 启用启用了检查点的新火花流程,从kafka读取,处理数据并存储到HDFS和HBase

  2. 杀死火花流媒体作业,消息继续流入Kafka

  3. 重新启动spark streaming job,这就是我们真正想要发生的事情:Spark流读取检查点数据并使用正确的kafka偏移重新启动。即使火花流工作被杀死并重新启动,也不会跳过kafka消息

  4. 这似乎不起作用,火花流作业不会启动(粘贴下面错误的堆栈跟踪)。我可以重新提交作业的唯一方法是删除检查点目录。这当然意味着所有检查点信息都丢失了,火花作业开始只读取新的Kafka消息。

    这应该有效吗?如果是的话,我是否需要做一些特定的事情才能让它发挥作用?

    以下是示例代码:

    1)我在Spark 1.6.2上。以下是我如何创建流式上下文:

    val ddqSsc = StreamingContext.getOrCreate(checkpointDir, () =>
                         createDDQStreamingContext(slideInterval.toLong, inputKafka, outputKafka, hbaseVerTableName, checkpointDir, baseRawHdfs, securityProtocol, groupID, zooKeeper, kafkaBrokers, hiveDBToLoad, hiveTableToLoad))
    

    2)以下是getOrCreate调用的函数的初始部分:

    def createDDQStreamingContext(slideInterval: Long, inputKafka: String, outputKafka: String, hbaseVerTableName: String, checkpointDir: String, baseRawHdfs: String, securityProtocol: String, groupID: String, zooKeeper: String, kafkaBrokers: String, hiveDBToLoad: String, hiveTableToLoad: String): StreamingContext = {
        val sparkConf = new SparkConf()
        val ssc = new StreamingContext(sparkConf, Seconds(slideInterval))
    
    //val sqlContext = new SQLContext(sc)
    val sqlContext = new HiveContext(ssc.sparkContext)
    import sqlContext.implicits._
    
    ssc.checkpoint(checkpointDir)
    val kafkaTopics = Set(inputKafka)
    
    //Kafka parameters
    var kafkaParams = Map[String, String]()
    kafkaParams += ("bootstrap.servers" -> kafkaBrokers)
    kafkaParams += ("zookeeper.connect" -> zooKeeper)
    
    //Need this in a kerberos environment
    kafkaParams += ("security.protocol" -> securityProtocol)
    kafkaParams += ("sasl.kerberos.service.name" -> "kafka")
    //WHAT IS THIS!!??
    kafkaParams += ("group.id" -> groupID)
    
    kafkaParams += ("key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
    kafkaParams += ("value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
    
    val inputDataDstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, kafkaTopics)
    

    ==================堆叠痕迹====================

      

    2017-04-03 11:27:27,047 ERROR [Driver] yarn.ApplicationMaster:User   class抛出异常:java.lang.NullPointerException   显示java.lang.NullPointerException           在org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)           在org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:573)           在org.apache.spark.sql.DataFrame。(DataFrame.scala:132)           在org.apache.spark.sql.DataFrame $ .apply(DataFrame.scala:52)           在org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:417)           在org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)           at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor $$ anonfun $ createDDQStreamingContext $ 1.apply(ddqKafkaDataProcessor.scala:97)           at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor $$ anonfun $ createDDQStreamingContext $ 1.apply(ddqKafkaDataProcessor.scala:73)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 1 $$ anonfun $ apply $ 21.apply(DStream.scala:700)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 1 $$ anonfun $ apply $ 21.apply(DStream.scala:700)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 2 $$ anonfun $ 5.apply(DStream.scala:714)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 2 $$ anonfun $ 5.apply(DStream.scala:712)           在org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:46)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352)           在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351)           在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)           at org.apache.spark.streaming.dstream.TransformedDStream.createRDDWithLocalProperties(TransformedDStream.scala:65)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:346)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:344)           在scala.Option.orElse(Option.scala:257)           在org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)           在org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352)           在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351)           在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:346)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:344)           在scala.Option.orElse(Option.scala:257)           在org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)           在org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47)           在org.apache.spark.streaming.DStreamGraph $$ anonfun $ 1.apply(DStreamGraph.scala:115)           在org.apache.spark.streaming.DStreamGraph $$ anonfun $ 1.apply(DStreamGraph.scala:114)           在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251)           在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251)           在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)           在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)           在scala.collection.TraversableLike $ class.flatMap(TraversableLike.scala:251)           在scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)           在org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:114)           在org.apache.spark.streaming.scheduler.JobGenerator $$ anonfun $ restart $ 4.apply(JobGenerator.scala:233)           在org.apache.spark.streaming.scheduler.JobGenerator $$ anonfun $ restart $ 4.apply(JobGenerator.scala:228)           在scala.collection.IndexedSeqOptimized $ class.foreach(IndexedSeqOptimized.scala:33)           在scala.collection.mutable.ArrayOps $ ofRef.foreach(ArrayOps.scala:108)           在org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228)           在org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97)           在org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83)           在org.apache.spark.streaming.StreamingContext $$ anonfun $ liftedTree1 $ 1 $ 1.apply $ mcV $ sp(StreamingContext.scala:610)           在org.apache.spark.streaming.StreamingContext $$ anonfun $ liftedTree1 $ 1 $ 1.apply(StreamingContext.scala:606)           在org.apache.spark.streaming.StreamingContext $$ anonfun $ liftedTree1 $ 1 $ 1.apply(StreamingContext.scala:606)           at ...使用org.apache.spark.util.ThreadUtils ...()在单独的线程中运行           在org.apache.spark.streaming.StreamingContext.liftedTree1 $ 1(StreamingContext.scala:606)           在org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)           在com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor $ .main(ddqKafkaDataProcessor.scala:402)           在com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor.main(ddqKafkaDataProcessor.scala)           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           at java.lang.reflect.Method.invoke(Method.java:498)           在org.apache.spark.deploy.yarn.ApplicationMaster $$ anon $ 2.run(ApplicationMaster.scala:559)

0 个答案:

没有答案