我们正在构建一个容错系统,可以从Kafka读取并编写HBase& HDFS。批次每5秒运行一次。这是我们希望设置的场景:
启用启用了检查点的新火花流程,从kafka读取,处理数据并存储到HDFS和HBase
杀死火花流媒体作业,消息继续流入Kafka
重新启动spark streaming job,这就是我们真正想要发生的事情:Spark流读取检查点数据并使用正确的kafka偏移重新启动。即使火花流工作被杀死并重新启动,也不会跳过kafka消息
这似乎不起作用,火花流作业不会启动(粘贴下面错误的堆栈跟踪)。我可以重新提交作业的唯一方法是删除检查点目录。这当然意味着所有检查点信息都丢失了,火花作业开始只读取新的Kafka消息。
这应该有效吗?如果是的话,我是否需要做一些特定的事情才能让它发挥作用?
以下是示例代码:
1)我在Spark 1.6.2上。以下是我如何创建流式上下文:
val ddqSsc = StreamingContext.getOrCreate(checkpointDir, () =>
createDDQStreamingContext(slideInterval.toLong, inputKafka, outputKafka, hbaseVerTableName, checkpointDir, baseRawHdfs, securityProtocol, groupID, zooKeeper, kafkaBrokers, hiveDBToLoad, hiveTableToLoad))
2)以下是getOrCreate调用的函数的初始部分:
def createDDQStreamingContext(slideInterval: Long, inputKafka: String, outputKafka: String, hbaseVerTableName: String, checkpointDir: String, baseRawHdfs: String, securityProtocol: String, groupID: String, zooKeeper: String, kafkaBrokers: String, hiveDBToLoad: String, hiveTableToLoad: String): StreamingContext = {
val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Seconds(slideInterval))
//val sqlContext = new SQLContext(sc)
val sqlContext = new HiveContext(ssc.sparkContext)
import sqlContext.implicits._
ssc.checkpoint(checkpointDir)
val kafkaTopics = Set(inputKafka)
//Kafka parameters
var kafkaParams = Map[String, String]()
kafkaParams += ("bootstrap.servers" -> kafkaBrokers)
kafkaParams += ("zookeeper.connect" -> zooKeeper)
//Need this in a kerberos environment
kafkaParams += ("security.protocol" -> securityProtocol)
kafkaParams += ("sasl.kerberos.service.name" -> "kafka")
//WHAT IS THIS!!??
kafkaParams += ("group.id" -> groupID)
kafkaParams += ("key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParams += ("value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
val inputDataDstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, kafkaTopics)
==================堆叠痕迹====================
2017-04-03 11:27:27,047 ERROR [Driver] yarn.ApplicationMaster:User class抛出异常:java.lang.NullPointerException 显示java.lang.NullPointerException 在org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 在org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:573) 在org.apache.spark.sql.DataFrame。(DataFrame.scala:132) 在org.apache.spark.sql.DataFrame $ .apply(DataFrame.scala:52) 在org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:417) 在org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155) at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor $$ anonfun $ createDDQStreamingContext $ 1.apply(ddqKafkaDataProcessor.scala:97) at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor $$ anonfun $ createDDQStreamingContext $ 1.apply(ddqKafkaDataProcessor.scala:73) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 1 $$ anonfun $ apply $ 21.apply(DStream.scala:700) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 1 $$ anonfun $ apply $ 21.apply(DStream.scala:700) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 2 $$ anonfun $ 5.apply(DStream.scala:714) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ transform $ 2 $$ anonfun $ 5.apply(DStream.scala:712) 在org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:46) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352) 在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351) 在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) at org.apache.spark.streaming.dstream.TransformedDStream.createRDDWithLocalProperties(TransformedDStream.scala:65) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:346) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:344) 在scala.Option.orElse(Option.scala:257) 在org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341) 在org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1 $$ anonfun $ apply $ 7.apply(DStream.scala:352) 在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1 $$ anonfun $ 1.apply(DStream.scala:351) 在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:346) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ getOrCompute $ 1.apply(DStream.scala:344) 在scala.Option.orElse(Option.scala:257) 在org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341) 在org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47) 在org.apache.spark.streaming.DStreamGraph $$ anonfun $ 1.apply(DStreamGraph.scala:115) 在org.apache.spark.streaming.DStreamGraph $$ anonfun $ 1.apply(DStreamGraph.scala:114) 在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251) 在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 在scala.collection.TraversableLike $ class.flatMap(TraversableLike.scala:251) 在scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) 在org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:114) 在org.apache.spark.streaming.scheduler.JobGenerator $$ anonfun $ restart $ 4.apply(JobGenerator.scala:233) 在org.apache.spark.streaming.scheduler.JobGenerator $$ anonfun $ restart $ 4.apply(JobGenerator.scala:228) 在scala.collection.IndexedSeqOptimized $ class.foreach(IndexedSeqOptimized.scala:33) 在scala.collection.mutable.ArrayOps $ ofRef.foreach(ArrayOps.scala:108) 在org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228) 在org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97) 在org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83) 在org.apache.spark.streaming.StreamingContext $$ anonfun $ liftedTree1 $ 1 $ 1.apply $ mcV $ sp(StreamingContext.scala:610) 在org.apache.spark.streaming.StreamingContext $$ anonfun $ liftedTree1 $ 1 $ 1.apply(StreamingContext.scala:606) 在org.apache.spark.streaming.StreamingContext $$ anonfun $ liftedTree1 $ 1 $ 1.apply(StreamingContext.scala:606) at ...使用org.apache.spark.util.ThreadUtils ...()在单独的线程中运行 在org.apache.spark.streaming.StreamingContext.liftedTree1 $ 1(StreamingContext.scala:606) 在org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600) 在com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor $ .main(ddqKafkaDataProcessor.scala:402) 在com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor.main(ddqKafkaDataProcessor.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) 在org.apache.spark.deploy.yarn.ApplicationMaster $$ anon $ 2.run(ApplicationMaster.scala:559)