我正在尝试在像这样的YARN群集上启动Flink作业
flink run -m yarn-cluster -yn 2 -yjm 1024 -ytm 2048 -c myclass flink-test-1.0-SNAPSHOT
不幸的是,作业失败并显示无法联系JobManager的消息。相同的作业在仅由一个节点(伪群集)组成的YARN群集上成功运行。 flink-yarn-yaml指向Hadoop配置的正确位置。该作业的YARN应用程序日志也不包含可能出错的有价值信息。客户端和工作节点之间的通信应该没有任何问题,因为其他与集群相关的技术,如Spark,Hive,MR可以正常工作。
你能否提示这里可能出现的问题?我还试图在flink-conf.yaml中启用zookeeper领导者选举,但也没有帮助。
13-09-2017 16:20:32 org.apache.flink.yarn.cli.FlinkYarnSessionCli.createDescriptor main INFO - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
13-09-2017 16:20:32 org.apache.flink.yarn.cli.FlinkYarnSessionCli.createDescriptor main INFO - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
13-09-2017 16:20:32 org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal main INFO - Using values:
13-09-2017 16:20:32 org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal main INFO - TaskManager count = 2
13-09-2017 16:20:32 org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal main INFO - JobManager memory = 1024
13-09-2017 16:20:32 org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal main INFO - TaskManager memory = 2048
13-09-2017 16:20:32 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceInit main INFO - Timeline service address: http://server:8188/ws/v1/timeline/
13-09-2017 16:20:33 org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal main WARN - The JobManager or TaskManager memory is below the smallest possible YARN Container size. The value of 'yarn.scheduler.minimum-allocation-mb' is '5120'. Please increase the memory size.YARN will allocate the smaller containers but the scheduler will account for the minimum-allocation-mb, maybe not all instances you requested will start.
13-09-2017 16:20:33 org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster main WARN - The configuration directory ('/flink-1.3.2/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
13-09-2017 16:20:33 org.apache.flink.yarn.Utils.setupLocalResource main INFO - Copying from file:/flink-1.3.2/conf/log4j.properties to hdfs://nn:8020/user/user/.flink/application_1504618071460_3715/log4j.properties
13-09-2017 16:20:33 org.apache.flink.yarn.Utils.setupLocalResource main INFO - Copying from file:/flink-1.3.2/lib to hdfs://nn:8020/user/user/.flink/application_1504618071460_3715/lib
13-09-2017 16:20:34 org.apache.flink.yarn.Utils.setupLocalResource main INFO - Copying from file:/flink-1.3.2/conf/logback.xml to hdfs://nn:8020/user/user/.flink/application_1504618071460_3715/logback.xml
13-09-2017 16:20:34 org.apache.flink.yarn.Utils.setupLocalResource main INFO - Copying from file:code/flink-test/target/flink-test-1.0-SNAPSHOT.jar to hdfs://nn:8020/user/user/.flink/application_1504618071460_3715/flink-test-1.0-SNAPSHOT.jar
13-09-2017 16:20:34 org.apache.flink.yarn.Utils.setupLocalResource main INFO - Copying from file:/flink-1.3.2/lib/flink-dist_2.11-1.3.2.jar to hdfs://nn:8020/user/user/.flink/application_1504618071460_3715/flink-dist_2.11-1.3.2.jar
13-09-2017 16:20:35 org.apache.flink.yarn.Utils.setupLocalResource main INFO - Copying from /flink-1.3.2/conf/flink-conf.yaml to hdfs://nn:8020/user/user/.flink/application_1504618071460_3715/flink-conf.yaml
13-09-2017 16:20:35 org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster main INFO - Adding delegation token to the AM container..
13-09-2017 16:20:35 org.apache.hadoop.hdfs.DFSClient.getDelegationToken main INFO - Created HDFS_DELEGATION_TOKEN token 126644 for user on ha-hdfs:nn
13-09-2017 16:20:35 org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal main INFO - Got dt for hdfs://nn:8020; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nn, Ident: (HDFS_DELEGATION_TOKEN token 126644 for user)
13-09-2017 16:20:35 org.apache.flink.yarn.Utils.obtainTokenForHBase main INFO - Attempting to obtain Kerberos security token for HBase
13-09-2017 16:20:35 org.apache.flink.yarn.Utils.obtainTokenForHBase main INFO - HBase is not available (not packaged with this application): ClassNotFoundException : "org.apache.hadoop.hbase.HBaseConfiguration".
13-09-2017 16:20:35 org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster main INFO - Submitting application master application_1504618071460_3715
13-09-2017 16:20:35 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication main INFO - Submitted application application_1504618071460_3715
13-09-2017 16:20:35 org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster main INFO - Waiting for the cluster to be allocated
13-09-2017 16:20:35 org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster main INFO - Deploying cluster, current state ACCEPTED
13-09-2017 16:20:40 org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster main INFO - YARN application has been deployed successfully.
Cluster started: Yarn cluster with application id application_1504618071460_3715
Using address server2.corp.int:42683 to connect to JobManager.
JobManager web interface address http://server.corp.int:8088/proxy/application_1504618071460_3715/
Using the parallelism provided by the remote cluster (96). To use another parallelism, set it at the ./bin/flink client.
Starting execution of program
13-09-2017 16:20:40 org.apache.flink.client.program.ClusterClient.run main INFO - Starting program in interactive mode
13-09-2017 16:20:40 org.apache.flink.client.program.ClusterClient.logAndSysout main INFO - Waiting until all TaskManagers have connected
Waiting until all TaskManagers have connected
13-09-2017 16:20:40 org.apache.flink.client.program.ClusterClient$LazyActorSystemLoader.get main INFO - Starting client actor system.
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error.
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:545)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:381)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:421)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1130)
Caused by: java.lang.RuntimeException: Unable to get ClusterClient status from Application Client
at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
at org.apache.flink.yarn.YarnClusterClient.waitForClusterToBeReady(YarnClusterClient.java:507)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:454)
at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:205)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:429)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
at org.apache.flink.api.java.DataSet.print(DataSet.java:1605)
at org.apache.flink.api.scala.DataSet.print(DataSet.scala:1726)
at com.ing.diba.iaa.flink.FlinkWordCountTest$.main(FlinkWordCountTest.scala:21)
at com.ing.diba.iaa.flink.FlinkWordCountTest.main(FlinkWordCountTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
... 13 more
Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.
at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)
at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:238)
... 30 more
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.
at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)
at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)
... 31 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77)
... 32 more
13-09-2017 16:20:51 org.apache.flink.yarn.YarnClusterClient.shutdownCluster main INFO - Sending shutdown request to the Application Master
13-09-2017 16:20:51 org.apache.flink.yarn.YarnClusterClient$LazApplicationClientLoader.get main INFO - Start application client.
13-09-2017 16:20:51 org.apache.flink.yarn.YarnClusterClient.getApplicationStatus main WARN - YARN reported application state FAILED
13-09-2017 16:20:51 org.apache.flink.yarn.YarnClusterClient.getApplicationStatus main WARN - Diagnostics: Application application_1504618071460_3715 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1504618071460_3715_000001 exited with exitCode: 243
For more detailed output, check the application tracking page: http://server.corp.int:8088/cluster/app/application_1504618071460_3715 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e18_1504618071460_3715_01_000001
Exit code: 243
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:109)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:89)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:392)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
main : run as user is user
main : requested yarn user is user
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /BIGDATA/hadoop/yarn/local/nmPrivate/application_1504618071460_3715/container_e18_1504618071460_3715_01_000001/container_e18_1504618071460_3715_01_000001.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
Container exited with a non-zero exit code 243
Failing this attempt. Failing the application.
13-09-2017 16:20:51 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Notification about new leader address akka.tcp://flink@server2.corp.int:42683/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
13-09-2017 16:20:51 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:51 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Received address of new leader akka.tcp://flink@server2.corp.int:42683/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
13-09-2017 16:20:51 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Disconnect from JobManager null.
13-09-2017 16:20:51 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Trying to register at JobManager akka.tcp://flink@server2.corp.int:42683/user/jobmanager.
13-09-2017 16:20:52 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Trying to register at JobManager akka.tcp://flink@server2.corp.int:42683/user/jobmanager.
13-09-2017 16:20:52 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:53 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Trying to register at JobManager akka.tcp://flink@server2.corp.int:42683/user/jobmanager.
13-09-2017 16:20:53 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:54 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:55 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Trying to register at JobManager akka.tcp://flink@server2.corp.int:42683/user/jobmanager.
13-09-2017 16:20:55 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:56 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:57 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:58 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:20:59 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Trying to register at JobManager akka.tcp://flink@server2.corp.int:42683/user/jobmanager.
13-09-2017 16:20:59 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:21:00 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:21:01 org.apache.flink.yarn.YarnClusterClient.shutdownCluster main WARN - Error while stopping YARN cluster.
java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.ready(package.scala:169)
at scala.concurrent.Await.ready(package.scala)
at org.apache.flink.yarn.YarnClusterClient.shutdownCluster(YarnClusterClient.java:367)
at org.apache.flink.yarn.YarnClusterClient.finalizeCluster(YarnClusterClient.java:337)
at org.apache.flink.client.program.ClusterClient.shutdown(ClusterClient.java:245)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:267)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:421)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1130)
13-09-2017 16:21:01 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-4 INFO - Sending StopCluster request to JobManager.
13-09-2017 16:21:02 org.apache.flink.yarn.YarnClusterClient.shutdownCluster main INFO - Application application_1504618071460_3715 finished with state FAILED and final state FAILED at 1505312441186
13-09-2017 16:21:02 org.apache.flink.yarn.YarnClusterClient.shutdownCluster main WARN - Application failed. Diagnostics Application application_1504618071460_3715 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1504618071460_3715_000001 exited with exitCode: 243
For more detailed output, check the application tracking page: http://server.corp.int:8088/cluster/app/application_1504618071460_3715 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e18_1504618071460_3715_01_000001
Exit code: 243
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:109)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:89)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:392)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
main : run as user is user
main : requested yarn user is user
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /BIGDATA/hadoop/yarn/local/nmPrivate/application_1504618071460_3715/container_e18_1504618071460_3715_01_000001/container_e18_1504618071460_3715_01_000001.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
Container exited with a non-zero exit code 243
Failing this attempt. Failing the application.
13-09-2017 16:21:02 org.apache.flink.yarn.YarnClusterClient.shutdownCluster main WARN - If log aggregation is activated in the Hadoop cluster, we recommend to retrieve the full application log using this command:
yarn logs -applicationId application_1504618071460_3715
(It sometimes takes a few seconds until the logs are aggregated)
13-09-2017 16:21:02 org.apache.flink.yarn.YarnClusterClient.shutdownCluster main INFO - YARN Client is shutting down
13-09-2017 16:21:02 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Stopped Application client.
13-09-2017 16:21:02 grizzled.slf4j.Logger.info flink-akka.actor.default-dispatcher-2 INFO - Disconnect from JobManager null.