如何从Dataproc上的检查点重新启动Spark Streaming作业?

时间:2017-05-16 17:55:11

标签: google-cloud-dataproc

这是Spark streaming on dataproc throws FileNotFoundException

的后续行动

在过去的几周内(不知道确切的时间),重启火花流媒体作业,即使使用" kill dataproc.agent"诀窍是抛出这个例外:

17/05/16 17:39:02 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at stream-event-processor-m/10.138.0.3:8032
17/05/16 17:39:03 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1494955637459_0006
17/05/16 17:39:04 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
    at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:140)
    at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:826)
    at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:826)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:826)
    at com.thumbtack.common.model.SparkStream$class.main(SparkStream.scala:73)
    at com.thumbtack.skyfall.StreamEventProcessor$.main(StreamEventProcessor.scala:19)
    at com.thumbtack.skyfall.StreamEventProcessor.main(StreamEventProcessor.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/05/16 17:39:04 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@5555ffcf{HTTP/1.1}{0.0.0.0:4479}
17/05/16 17:39:04 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
17/05/16 17:39:04 ERROR org.apache.spark.util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
    at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:152)
    at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1360)
    at org.apache.spark.SparkEnv.stop(SparkEnv.scala:87)
    at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1797)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1290)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1796)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:565)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
    at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:140)
    at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:826)
    at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:826)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:826)
    at com.thumbtack.common.model.SparkStream$class.main(SparkStream.scala:73)
    at com.thumbtack.skyfall.StreamEventProcessor$.main(StreamEventProcessor.scala:19)
    at com.thumbtack.skyfall.StreamEventProcessor.main(StreamEventProcessor.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
    at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:140)
    at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:826)
    at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:826)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:826)
    at com.thumbtack.common.model.SparkStream$class.main(SparkStream.scala:73)
    at com.thumbtack.skyfall.StreamEventProcessor$.main(StreamEventProcessor.scala:19)
    at com.thumbtack.skyfall.StreamEventProcessor.main(StreamEventProcessor.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Job output is complete

如何从Dataproc群集上的检查点重新启动Spark流式传输作业?

1 个答案:

答案 0 :(得分:2)

我们最近为数据广告作业添加了自动重启功能(gcloud beta跟踪和v1 API中提供)。

要利用自动重新启动,作业必须能够恢复/清除,以便在不进行修改的情况下对大多数作业不起作用。但是, 开箱即用,使用检查点文件进行Spark流式传输

不再需要restart-dataproc-agent技巧。自动重启可以抵御作业崩溃,Dataproc Agent故障和VM重新启动迁移事件。

实施例: gcloud beta dataproc jobs submit spark ... --max-failures-per-hour 1

请参阅: https://cloud.google.com/dataproc/docs/concepts/restartable-jobs

如果要测试恢复,可以通过重新启动主VM来模拟VM迁移[1]。在此之后,您应该能够描述作业[2]并在statusHistory中查看ATTEMPT_FAILURE条目。

[1] gcloud compute instances reset <cluster-name>-m

[2] gcloud dataproc jobs describe