纱线崩溃时恢复Flink

时间:2018-11-08 19:13:33

标签: apache-flink

我正在EMR(1个Master 2 Core节点)上运行yarn 3节点集群。我正在使用1.6.0。我启用了检查点功能(rocksdb),正在写入S3。检查点在其他测试中似乎可以正常工作。在主节点上发生纱线崩溃(在这种情况下,我杀死了纱线过程)的情况下,我无法从最后一个检查点恢复我的应用程序。这是我尝试重新启动时的输出:

[hadoop@emr flink-1.6.0]$ bin/flink run -s s3://bucket/kinesis-pipeline-checkpoint/a8a9ceb95845c3ea9833e025b5771470 -p 1 -d ~/pipeline-assembly-0.2.0.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-11-08 19:01:06,069 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Found Yarn properties file under /tmp/.yarn-properties-hadoop.
2018-11-08 19:01:06,069 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Found Yarn properties file under /tmp/.yarn-properties-hadoop.
2018-11-08 19:01:06,488 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - YARN properties set default parallelism to 1
2018-11-08 19:01:06,488 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - YARN properties set default parallelism to 1
YARN properties set default parallelism to 1
2018-11-08 19:01:06,637 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting to ResourceManager at emr:8032
2018-11-08 19:01:06,745 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2018-11-08 19:01:06,745 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2018-11-08 19:01:06,845 INFO  org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Found application JobManager host name 'emr' and port '39541' from supplied application id 'application_1541703591281_0001'
Starting execution of program

------------------------------------------------------------
The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: c701b6511ad76b5e4faae703763f388e)
    at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:249)
    at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
    at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
    at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
    at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
    at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
    at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
    at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
    at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
    at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
    at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
    at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
    at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
    at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
    at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
    at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
    at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
    at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
    at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
    at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
    at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
    ... 12 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
    ... 10 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
    at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
    at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
    at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
    at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953)
    at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
    ... 4 more
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
    at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:310)
    at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:294)
    at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
    ... 5 more

这是预期的行为,还是在这种情况下我做错了什么?

谢谢

更新:jobmanager.log

    LogType:jobmanager.log
    Log Upload Time:Tue Nov 20 16:37:52 +0000 2018
    LogLength:49255
    Log Contents:
    2018-11-20 16:33:33,276 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
    2018-11-20 16:33:33,277 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting YarnSessionClusterEntrypoint (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 @ 13:31:13 UTC)
    2018-11-20 16:33:33,278 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: yarn
    2018-11-20 16:33:33,672 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: hadoop
    2018-11-20 16:33:33,672 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
    2018-11-20 16:33:33,672 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 13653 MiBytes
    2018-11-20 16:33:33,672 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /usr/lib/jvm/java-openjdk
    2018-11-20 16:33:33,673 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.8.3
    2018-11-20 16:33:33,673 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
    2018-11-20 16:33:33,673 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Xmx15360m
    2018-11-20 16:33:33,673 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog.file=/var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.log
    2018-11-20 16:33:33,673 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:logback.xml
    2018-11-20 16:33:33,673 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:log4j.properties
    2018-11-20 16:33:33,673 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments: (none)
    2018-11-20 16:33:33,674 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
    2018-11-20 16:33:33,675 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
    2018-11-20 16:33:33,678 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - YARN daemon is running as: hadoop Yarn client user obtainer: hadoop
    2018-11-20 16:33:33,680 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.fs.checkpointdir, s3://bucket/kinesis-checkpoint
    2018-11-20 16:33:33,680 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.timeout, 60000
    2018-11-20 16:33:33,680 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.cluster-id, application_1542731534971_0001
    2018-11-20 16:33:33,680 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, localhost
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: internal.cluster.execution-mode, NORMAL
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.fraction, 0.9
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 1
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, rocksdb
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.ask.timeout, 60s
    2018-11-20 16:33:33,681 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 20480m
    2018-11-20 16:33:33,682 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 20480m
    2018-11-20 16:33:33,682 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, s3://bucket/kinesis-checkpoint
    2018-11-20 16:33:33,695 INFO  org.apache.flink.runtime.clusterframework.BootstrapTools      - Setting directories for temporary files to: /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001
    2018-11-20 16:33:33,708 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting YarnSessionClusterEntrypoint.
    2018-11-20 16:33:33,708 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
    2018-11-20 16:33:33,772 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to hadoop (auth:SIMPLE)
    2018-11-20 16:33:33,786 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
    2018-11-20 16:33:33,791 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at ip-172-31-18-80.us-west-2.compute.internal:45751
    2018-11-20 16:33:34,239 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
    2018-11-20 16:33:34,328 INFO  akka.remote.Remoting                                          - Starting remoting
    2018-11-20 16:33:34,428 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@ip-172-31-18-80.us-west-2.compute.internal:45751]
    2018-11-20 16:33:34,437 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@ip-172-31-18-80.us-west-2.compute.internal:45751
    2018-11-20 16:33:34,469 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/blobStore-1dc43ec8-8ed7-4342-adae-c8d20a691640
    2018-11-20 16:33:34,473 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:39955 - max concurrent requests: 50 - max backlog: 1000
    2018-11-20 16:33:34,488 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
    2018-11-20 16:33:34,492 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/executionGraphStore-0c4fd7ac-17d2-40d6-b279-dfef5041a76f, expiration time 3600000, maximum cache size 52428800 bytes.
    2018-11-20 16:33:34,514 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/blobStore-4c662c5c-afa5-4bf2-8a01-3acc0b9aa491
    2018-11-20 16:33:34,521 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /tmp/flink-web-6885656b-18cc-451f-8853-03ff7cf14b0e/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
    2018-11-20 16:33:34,522 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /tmp/flink-web-6885656b-18cc-451f-8853-03ff7cf14b0e/flink-web-upload for file uploads.
    2018-11-20 16:33:34,525 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
    2018-11-20 16:33:34,702 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component log file: /var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.log
    2018-11-20 16:33:34,702 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component stdout file: /var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.out
    2018-11-20 16:33:34,844 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at ip-172-31-18-80.us-west-2.compute.internal:35939
    2018-11-20 16:33:34,844 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://ip-172-31-18-80.us-west-2.compute.internal:35939 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
    2018-11-20 16:33:34,844 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://ip-172-31-18-80.us-west-2.compute.internal:35939.
    2018-11-20 16:33:34,857 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.yarn.YarnResourceManager at akka://flink/user/resourcemanager .
    2018-11-20 16:33:34,948 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
    2018-11-20 16:33:34,981 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting to ResourceManager at ip-172-31-30-52.us-west-2.compute.internal/172.31.30.52:8030
    2018-11-20 16:33:35,234 INFO  org.apache.flink.yarn.YarnResourceManager                     - Recovered 0 containers from previous attempts ([]).
    2018-11-20 16:33:35,237 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - yarn.client.max-cached-nodemanagers-proxies : 0
    2018-11-20 16:33:35,238 INFO  org.apache.flink.yarn.YarnResourceManager                     - ResourceManager akka.tcp://flink@ip-172-31-18-80.us-west-2.compute.internal:45751/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000
    2018-11-20 16:33:35,239 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
    2018-11-20 16:33:35,252 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@ip-172-31-18-80.us-west-2.compute.internal:45751/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
    2018-11-20 16:33:35,252 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
    2018-11-20 16:34:20,094 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting job bd0d5dbaeba3990a3bef1eebee49cd79 (Data Session Pipeline v0.0.7).
    2018-11-20 16:34:20,108 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/jobmanager_0 .
    2018-11-20 16:34:20,115 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Initializing job Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
    2018-11-20 16:34:20,124 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647, delayBetweenRestartAttempts=0) for Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
    2018-11-20 16:34:20,127 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.slotpool.SlotPool at akka://flink/user/0e6f5de3-53ad-4bae-acf3-3c66106c0a54 .
    2018-11-20 16:34:20,148 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job recovers via failover strategy: full graph restart
    2018-11-20 16:34:20,170 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Running initialization on master for job Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
    2018-11-20 16:34:20,170 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Successfully ran initialization on master in 0 ms.
    2018-11-20 16:34:20,203 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Using application-defined state backend: RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints: 's3://bucket/kinesis-checkpoint', savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1), localRocksDbDirectories=null, enableIncrementalCheckpointing=TRUE}
    2018-11-20 16:34:20,203 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Configuring application-defined state backend with job/cluster config
    2018-11-20 16:34:22,624 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting job bd0d5dbaeba3990a3bef1eebee49cd79 from savepoint s3://bucket/kinesis-pipeline-checkpoint/8a6e5aeebeef202a2daddd3cf9419a80 ()
    2018-11-20 16:34:22,663 ERROR org.apache.flink.runtime.rest.handler.job.JobSubmitHandler    - Exception occurred in REST handler.
    org.apache.flink.runtime.rest.handler.RestHandlerException: Job submission failed.
            at org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$handleRequest$2(JobSubmitHandler.java:119)
            at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
            at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
            at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
            at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
            at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
            at akka.dispatch.OnComplete.internal(Future.scala:258)
            at akka.dispatch.OnComplete.internal(Future.scala:256)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
            at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
            at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
            at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
            at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
            at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:20)
            at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:18)
            at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
            at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
            at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
            at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
            at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
            at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
            at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
            at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
            at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
            at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
            at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
            at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
            at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
            at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$submitJob$2(Dispatcher.java:256)
            at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
            at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
            at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
            at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
            at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
            at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
            at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
            at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
            at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
            at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
            at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
            at akka.actor.ActorCell.invoke(ActorCell.scala:495)
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
            at akka.dispatch.Mailbox.run(Mailbox.scala:224)
            at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
            ... 4 more
    Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
            ... 24 more
    Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
            at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
            at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
            at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:708)
            at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
            ... 18 more
    Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
            at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
            at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40)
            at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820)
            at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
            ... 19 more
    Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
            at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
            at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
            at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
            at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
            at org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:266)
            at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
            ... 21 more
    Caused by: java.io.FileNotFoundException: Cannot find meta data file '_metadata' in directory 's3://sledfs/kinesis-pipeline-checkpoint/8a6e5aeebeef202a2daddd3cf9419a80'. Please try to load the checkpoint/savepoint directly from the metadata file instead of the directory.
            at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:256)
            at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:109)
            at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1102)
            at org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1220)
            at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1144)
            at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:295)
            at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157)
            ... 26 more
    2018-11-20 16:37:52,321 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
    2018-11-20 16:37:52,322 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
    2018-11-20 16:37:52,340 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:39955

1 个答案:

答案 0 :(得分:1)

您引用的s3://bucket/kinesis-pipeline-checkpoint/a8a9ceb95845c3ea9833e025b5771470检查点不包含有效的_metadata文件。这表明此检查点已启动,但无法完成。请选择已成功完成的检查点。