我有一个简单的火花流应用程序,它从rabbitMQ中读取数据 并且对于30秒的批处理间隔,窗口间隔为1分钟和1小时进行一些聚合。
我有三节点设置。并启用检查点, 我使用sshfs将相同的目录挂载到所有工作节点以创建检查点。
当我第一次运行spark streaming App时,它运行正常。 我可以看到结果打印在控制台上,一些检查点发生在网络目录中。
但是在我终止驱动程序进程并重新启动后,它失败并出现以下异常
ERROR 2015-11-06 08:29:10 org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1446778740000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 506.0 failed 4 times, most recent failure: Lost task 0.3 in stage 506.0 (TID 858, 10.29.23.166): java.lang.
Exception: Could not compute split, block input-0-1446778594400 not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[scala-library-2.10.5.jar:na]
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) ~[scala-library-2.10.5.jar:na]
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at scala.Option.foreach(Option.scala:236) ~[scala-library-2.10.5.jar:na]
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
WARN 2015-11-06 08:29:10 org.apache.spark.ui.jobs.JobProgressListener: Task start for unknown stage 507
WARN 2015-11-06 08:29:10 org.apache.spark.ui.jobs.JobProgressListener: Task start for unknown stage 508
WARN 2015-11-06 08:29:10 org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 509.0 (TID 882): java.lang.Exception: Could not compute split, block input-0-1446778
622600 not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
重复例外。
我没有向rabbitMQ提供大量数据。当我第一次运行这个工作时,我只是倾销< 100个活动。 当我第二次运行时,我已经停止了从生产者进程发送到RabbitMQ的消息。
我尝试过设置"spark.streaming.unpersist","true"
。
而我的设置有3个节点,每个节点有一个核心分配给spark,每个节点的执行器内存为512MB。
Spark版本 - 1.4.1(DSE 4.8)
stratio receiver rabbitmq - release 1.0
代码:
def createContext(checkpointDirectory: String, config: Config): StreamingContext = {
println("Creating new context")
val conf = new SparkConf(true).setAppName(appName).set("spark.streaming.unpersist","true")
val ssc = new StreamingContext(conf, Seconds(config.getInt(batchIntervalParam)))
ssc.checkpoint(checkpointDirectory)
val isValid = validate(ssc, config)
if (isValid) {
val result = runJob(ssc, config)
println("result is " + result)
} else {
println(isValid.toString)
}
ssc
}
def main(args: Array[String]): Unit = {
if (args.length < 1) {
println("Must specify the path to config file ")
println("Usage progname <path to config file> ")
return
}
val url = args(0)
logger.info("Starting " + appName)
println("Got the path as %s".format(url))
val source = scala.io.Source.fromFile(url)
val lines = try source.mkString finally source.close()
val config = ConfigFactory.parseString(lines)
val directoryPath = config.getString(checkPointParam)
val ssc = StreamingContext.getOrCreate(directoryPath, () => {
createContext(directoryPath,config)
})
ssc.start()
ssc.awaitTermination()
}
def getRabbitMQStream(config: Config, ssc: StreamingContext): ReceiverInputDStream[String] = {
val rabbitMQHost = config.getString(rabbitmqHostParam)
val rabbitMQPort = config.getInt(rabbitmqPortParam)
val rabbitMQQueue = config.getString(rabbitmqQueueNameParam)
println("changing the memory lvel")
val receiverStream: ReceiverInputDStream[String] = {
RabbitMQUtils.createStreamFromAQueue(ssc, rabbitMQHost, rabbitMQPort, rabbitMQQueue,StorageLevel.MEMORY_AND_DISK_SER)
}
receiverStream.start()
receiverStream
}
def getBaseDstream(config: Config, ssc: StreamingContext): ReceiverInputDStream[String] = {
val baseDstream = config.getString(receiverTypeParam) match {
case "rabbitmq" => getRabbitMQStream(config, ssc)
}
baseDstream
}
def runJob(ssc: StreamingContext, config: Config): Any = {
val keyspace = config.getString(keyspaceParam)
val clientStatsTable = config.getString(clientStatsTableParam)
val hourlyStatsTable = config.getString(hourlyStatsTableParam)
val batchInterval = config.getInt(batchIntervalParam)
val windowInterval = config.getInt(windowIntervalParam)
val hourlyInterval = config.getInt(hourlyParam)
val limit = config.getInt(limitParam)
val lines = getBaseDstream(config, ssc)
val statsRDD = lines.filter(_.contains("client_stats")).map(_.split(",")(1))
val parserFunc = getProtobufParserFunction()
val clientUsageRDD: DStream[((String, String), Double)] = statsRDD.flatMap(x => parserFunc(x))
val formatterFunc = getJsonFormatterFunc()
val oneMinuteWindowResult = clientUsageRDD.reduceByKeyAndWindow((x: Double, y: Double) => x + y, Seconds(windowInterval), Seconds(batchInterval))
.map(x => ((x._1._2), ArrayBuffer((x._1._1, x._2))))
.reduceByKey((x, y) => (x ++ y))
.mapValues(x => (x.toList.sortBy(x => -x._2).take(limit)))
println("Client Usage from rabbitmq ")
oneMinuteWindowResult.map(x => (x._1, DateTime.now, formatterFunc(x._2))).saveToCassandra(keyspace, clientStatsTable)
oneMinuteWindowResult.print()
val HourlyResult = clientUsageRDD.reduceByKeyAndWindow((x: Double, y: Double) => x + y, Seconds(hourlyInterval), Seconds(batchInterval))
.map(x => ((x._1._2), ArrayBuffer((x._1._1, x._2))))
.reduceByKey((x, y) => (x ++ y))
.mapValues(x => (x.toList.sortBy(x => -x._2).take(limit)))
HourlyResult.map(x => (x._1, DateTime.now, formatterFunc(x._2))).saveToCassandra(keyspace, hourlyStatsTable)
HourlyResult.map(x => (x, "hourly")).print()
}
}
请帮我解决这个问题。
答案 0 :(得分:2)
您正在错误地创建StreamingContext以使用检查点。
正如您在此处所见:http://spark.apache.org/docs/1.4.1/streaming-programming-guide.html#how-to-configure-checkpointing实例化StreamingContext以使用检查点的正确方法是:
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...
// Start the context
context.start()
context.awaitTermination()
如果您没有特定检查点目录中的数据,则只需要创建一个新的StreamingContext实例。
另外,关于checkpointing文件夹,据我所知,您需要在群集中安装HDFS,而不是使用sshfs在节点之间共享数据:
配置检查点 - 如果流应用程序需要它, 然后是Hadoop API兼容的容错存储中的目录 (例如HDFS,S3等)必须配置为检查点目录 和以检查点的方式编写的流应用程序 信息可用于故障恢复
此处有更多信息:http://spark.apache.org/docs/1.4.1/streaming-programming-guide.html#requirements
希望它有所帮助。
答案 1 :(得分:0)
从检查点恢复期间需要注意的两个要点 -
spark.streaming.receiver.writeAheadLog.enable
应为true
以启用提前写入日志。
创建DStream
,迭代RDD
批处理,写入HDFS和其他内容应该在将StreamingContext
返回到传递给getOrCreate()
的回调中之前完成。
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
// do all stuffs here
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
// or here
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
// nothing can be done here once the context has been created by restoring from checkpoint
// Start the context
context.start()
context.awaitTermination()