如果计算机位于不同的虚拟网络/子网中,则在Docker Swarm上运行的Flink无法解析作业管理器

时间:2017-10-31 23:18:08

标签: azure docker google-cloud-platform apache-flink docker-swarm

我正在运行标准的Flink Docker项目: https://github.com/apache/flink/tree/master/flink-contrib/docker-flink

作为群体一部分的机器处于不同的云中:Azure和Google Cloud。

以下是重现的步骤。

  1. 创建群体:docker swarm init --advertise-addr XXXXXX

  2. 创建注册表:docker service create --name registry --publish 5000:5000 registry:2

  3. 使用上面的工作人员令牌将所有计算机添加到swarm中。

    docker node ls将所有计算机显示为" Ready"。

  4. 将图片推送到注册表:docker-compose push

  5. 将Flink服务部署到swarm:docker stack deploy --compose-file docker-compose.yml flink

  6. 扩展Flink服务:docker service scale flink_taskmanager=20

  7. 继续检查docker service ps flink_taskmanager | grep Running

  8. Docker swarm将尝试在所有计算机上启动flink_taskmanager,但与运行flink_jobmanager的容器不在同一虚拟网络/子网中的那些将失败并显示以下错误:

    2017-10-31 22:37:32,255 WARN  org.apache.hadoop.security.UserGroupInformation               - PriviledgedActionException as:flink (auth:SIMPLE) cause:java.net.UnknownHostException: jobmanager: Name or service not known
    2017-10-31 22:37:32,256 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
    java.net.UnknownHostException: jobmanager: Name or service not known
            at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
            at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
            at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
            at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
            at java.net.InetAddress.getAllByName(InetAddress.java:1192)
            at java.net.InetAddress.getAllByName(InetAddress.java:1126)
            at java.net.InetAddress.getByName(InetAddress.java:1076)
            at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.getRpcUrl(AkkaRpcServiceUtils.java:173)
            at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.getRpcUrl(AkkaRpcServiceUtils.java:138)
            at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:78)
            at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1663)
            at org.apache.flink.runtime.taskmanager.TaskManager$$anon$2.call(TaskManager.scala:1574)
            at org.apache.flink.runtime.taskmanager.TaskManager$$anon$2.call(TaskManager.scala:1572)
            at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:422)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
            at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
            at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1572)
            at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
    

    这是我扩展任务管理器(失败之前)几秒后:

    thalita@ubuntu-swarm-manager:~/flink/flink-contrib/docker-flink$ docker service ps flink_taskmanager | grep Running
    ktcy4ujro1yo  flink_taskmanager.1       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 4 seconds ago
    qbpjoua6ctbg  flink_taskmanager.2       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 12 seconds ago
    ymlripufi9qe  flink_taskmanager.3       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 16 seconds ago
    xvfcqj2cnnph  flink_taskmanager.4       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 15 seconds ago
    lwvkkz3mx7ij  flink_taskmanager.6       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 7 seconds ago
    wrb78346dvmg  flink_taskmanager.7       flink:1.3.2-hadoop2-scala_2.10  google-cloud-worker-1  Running        Running 5 seconds ago
    m31bf1cenevj  flink_taskmanager.8       flink:1.3.2-hadoop2-scala_2.10  google-cloud-worker-1  Running        Running 5 seconds ago
    oe2ff8ijuer4  flink_taskmanager.9       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 11 seconds ago
    vuw3dxugyjyi  flink_taskmanager.10      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 15 seconds ago
    xhmdbi9jad86  flink_taskmanager.11      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 10 seconds ago
    o3tw38bok4b9  flink_taskmanager.12      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 10 minutes ago
    knc54g7ayp1g  flink_taskmanager.13      flink:1.3.2-hadoop2-scala_2.10  google-cloud-worker-1  Running        Running 7 seconds ago
    bqio2ubvik5j  flink_taskmanager.14      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 6 seconds ago
    qauubxm3msda  flink_taskmanager.15      flink:1.3.2-hadoop2-scala_2.10  google-cloud-worker-1  Running        Running 5 seconds ago
    v9hjfadfn9y6  flink_taskmanager.16      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 4 seconds ago
    d8oh7ol4g90y  flink_taskmanager.17      flink:1.3.2-hadoop2-scala_2.10  google-cloud-worker-1  Running        Running 3 seconds ago
    9d4m7bb1bprp  flink_taskmanager.18      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 10 seconds ago
    ri00r8ehvwsh  flink_taskmanager.19      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 5 seconds ago
    

    几秒钟后,只有Azure正在运行:

    docker service ps flink_taskmanager | grep Running
    ktcy4ujro1yo  flink_taskmanager.1       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    qbpjoua6ctbg  flink_taskmanager.2       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    ymlripufi9qe  flink_taskmanager.3       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    xvfcqj2cnnph  flink_taskmanager.4       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    5efusat5ay60  flink_taskmanager.5       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    lwvkkz3mx7ij  flink_taskmanager.6       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    v2vndema8k74  flink_taskmanager.7       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    l92tjj0498v2  flink_taskmanager.8       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    oe2ff8ijuer4  flink_taskmanager.9       flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    vuw3dxugyjyi  flink_taskmanager.10      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    xhmdbi9jad86  flink_taskmanager.11      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    o3tw38bok4b9  flink_taskmanager.12      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 12 minutes ago
    6rlm2pu2gn21  flink_taskmanager.13      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    bqio2ubvik5j  flink_taskmanager.14      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    63r9kmrh46gw  flink_taskmanager.15      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    v9hjfadfn9y6  flink_taskmanager.16      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    vmrf20o9eo5m  flink_taskmanager.17      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1   Running        Running 2 minutes ago
    9d4m7bb1bprp  flink_taskmanager.18      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    ri00r8ehvwsh  flink_taskmanager.19      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    8h21y4r49scb  flink_taskmanager.20      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-2   Running        Running 2 minutes ago
    

    作业管理器也在哪里运行:

    docker service ps flink_jobmanager | grep Running
    p6bzg567ewhn  flink_jobmanager.1      flink:1.3.2-hadoop2-scala_2.10  azure-swarm-worker-1  Running        Running about an hour ago
    

    当我使用https://github.com/apache/flink/tree/master/flink-contrib/docker-flink中的create-docker-swarm-service.sh脚本创建服务时,这是docker日志:

    Starting Task Manager
    config file:
    jobmanager.rpc.address: flink-jobmanager
    jobmanager.rpc.port: 6123
    jobmanager.heap.mb: 1024
    taskmanager.heap.mb: 1024
    taskmanager.numberOfTaskSlots: 2
    taskmanager.memory.preallocate: false
    parallelism.default: 1
    jobmanager.web.port: 8081
    blob.server.port: 6124
    query.server.port: 6125
    Starting taskmanager as a console application on host c42a6093f7bb.
    2017-11-01 11:20:51,459 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2017-11-01 11:20:51,522 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
    2017-11-01 11:20:51,522 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 @ 10:23:11 UTC)
    2017-11-01 11:20:51,522 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
    2017-11-01 11:20:51,522 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.141-b15
    2017-11-01 11:20:51,522 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 1024 MiBytes
    2017-11-01 11:20:51,522 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /docker-java-home/jre
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.2
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:+UseG1GC
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms1024M
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx1024M
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
    2017-11-01 11:20:51,526 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
    2017-11-01 11:20:51,527 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
    2017-11-01 11:20:51,527 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /opt/flink/conf
    2017-11-01 11:20:51,527 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /opt/flink/lib/flink-python_2.11-1.3.2.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.3.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.3.2.jar:::
    2017-11-01 11:20:51,527 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
    2017-11-01 11:20:51,528 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Registered UNIX signal handlers for [TERM, HUP, INT]
    2017-11-01 11:20:51,532 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 1048576
    2017-11-01 11:20:51,548 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /opt/flink/conf
    2017-11-01 11:20:51,551 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager
    2017-11-01 11:20:51,551 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
    2017-11-01 11:20:51,551 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.mb, 1024
    2017-11-01 11:20:51,551 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.mb, 1024
    2017-11-01 11:20:51,551 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 2
    2017-11-01 11:20:51,551 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.preallocate, false
    2017-11-01 11:20:51,552 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
    2017-11-01 11:20:51,552 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.web.port, 8081
    2017-11-01 11:20:51,552 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
    2017-11-01 11:20:51,553 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
    2017-11-01 11:20:51,560 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager
    2017-11-01 11:20:51,560 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
    2017-11-01 11:20:51,560 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.mb, 1024
    2017-11-01 11:20:51,560 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.mb, 1024
    2017-11-01 11:20:51,560 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 2
    2017-11-01 11:20:51,560 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.preallocate, false
    2017-11-01 11:20:51,561 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
    2017-11-01 11:20:51,561 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.web.port, 8081
    2017-11-01 11:20:51,561 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
    2017-11-01 11:20:51,561 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
    2017-11-01 11:20:51,585 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
    2017-11-01 11:20:51,621 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
    java.net.UnknownHostException: flink-jobmanager: Name or service not known
            at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
            at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
            at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
            at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
            at java.net.InetAddress.getAllByName(InetAddress.java:1192)
            at java.net.InetAddress.getAllByName(InetAddress.java:1126)
            at java.net.InetAddress.getByName(InetAddress.java:1076)
            at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.getRpcUrl(AkkaRpcServiceUtils.java:173)
            at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.getRpcUrl(AkkaRpcServiceUtils.java:138)
            at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:78)
            at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1663)
            at org.apache.flink.runtime.taskmanager.TaskManager$$anon$2.call(TaskManager.scala:1574)
            at org.apache.flink.runtime.taskmanager.TaskManager$$anon$2.call(TaskManager.scala:1572)
            at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:422)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
            at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
            at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1572)
            at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
    

0 个答案:

没有答案