如何对失败的Docker任务进行故障排除

时间:2018-10-05 16:35:29

标签: docker docker-compose docker-swarm

我只是在docker世界中开始自己的道路,关于如何组织一切的许多(基本)原则仍然不清楚。请帮助我了解如何处理失败的Docker任务。

我的docker服务无法正常工作,但这是次要问题。主要问题是,目前尚不清楚如何解决该问题。

documentation,它由 application 服务和 mongodb 服务组成。应用程序服务将日志写入 /opt/myapp/log/app.log 。完整的资源可以在This is my docker-compose file中找到。我还建立了一个here

让我们开始栈:

docker swarm init
Swarm initialized: current node (xpkngdn0vpr73nioalzbkem1k) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-6109nv6pn7eb9gtam8bq4m198k5sk7ztzf7hy7yfv5c47kcrmq-9fbrmmccd977kx22mivs7segn 192.168.65.3:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

docker stack deploy -c docker-compose.yml myapp
Creating network myapp_default
Creating service myapp_web
Creating service myapp_db

在那之后,让我们等待一会儿(〜1分钟),然后继续:

docker ps -a
CONTAINER ID        IMAGE                                              COMMAND                  CREATED              STATUS                        PORTS               NAMES
31e9f3a8f5aa        deniszhdanov/docker-swarm-troobleshoot-service:1   "java -jar /opt/myap…"   31 seconds ago       Up 25 seconds                 8090/tcp            myapp_web.1.sij3z7cbbynsxos6608ru2f8a
9fc6a5868c12        mongo:latest                                       "docker-entrypoint.s…"   About a minute ago   Up About a minute             27017/tcp           myapp_db.1.gxl5xwj1tg80nr16clbskk2oc
a3ff2ba0c8c5        deniszhdanov/docker-swarm-troobleshoot-service:1   "java -jar /opt/myap…"   About a minute ago   Exited (137) 32 seconds ago                       myapp_web.1.3dv8x2dx6kig4qkf1wc2axro8

我们看到有一个失败的任务。让我们尝试了解出了什么问题:

docker commit a3ff2ba0c8c5 snapshot
sha256:bec4756cadebbada400b4d1037cac671168396bf73b7d3e875c6f98f63522afd

docker run --rm -it snapshot /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:34:20 - Starting Start on a3ff2ba0c8c5 with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:34:21 - No active profile set, falling back to default profiles: default

任务日志不包含足以解决问题的目标信息。但是,当我们独立运行 application 图片时,该数据可用:

docker run --rm -d deniszhdanov/docker-swarm-troobleshoot-service:1
825a818b425feb7ed1f593c14a411efb68457aee9c6bfcf27f745fd58cfa0001

docker ps
CONTAINER ID        IMAGE                                              COMMAND                  CREATED             STATUS              PORTS               NAMES
825a818b425f        deniszhdanov/docker-swarm-troobleshoot-service:1   "java -jar /opt/myap…"   46 seconds ago      Up 44 seconds       8090/tcp            zen_bardeen

docker exec -it 825a818b425f /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:30:09 - Starting Start on 825a818b425f with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:30:09 - No active profile set, falling back to default profiles: default
2018-10-05 15:30:09 - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@18ef96: startup date <Fri Oct 05 15:30:09 GMT 2018>; root of context hierarchy
2018-10-05 15:30:10 - Cluster created with settings {hosts=<db:27017>, mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
2018-10-05 15:30:10 - Adding discovered server db:27017 to client view of cluster
2018-10-05 15:30:11 - Exception in monitor thread while connecting to server db:27017
com.mongodb.MongoSocketException: db: Name does not resolve
    at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
    at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
    at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
    at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:126)
    at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:114)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: db: Name does not resolve
    at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at java.net.InetAddress.getByName(InetAddress.java:1076)
    at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
    ... 5 common frames omitted

嗯,感谢所有设法到达这一点的人:)

问题:

  1. 为什么独立容器和任务的状态不同?
  2. 对失败的任务进行故障排除的推荐方法是什么

关于丹尼斯

1 个答案:

答案 0 :(得分:0)

最终,我找到了docker troubleshooting page,检查了docker日志并发现了这一点:

2018-10-06 11:58:29.662461+0800  localhost com.docker.hyperkit[583]: [91168.810550] CPU: 3 PID: 50578 Comm: java Not tainted 4.9.93-linuxkit-aufs #1
2018-10-06 11:58:29.663013+0800  localhost com.docker.hyperkit[583]: [91168.811356] Hardware name:   BHYVE, BIOS 1.00 03/14/2014
2018-10-06 11:58:29.663984+0800  localhost com.docker.hyperkit[583]: [91168.811909]  0000000000000000 ffffffffa243922a ffffbb9dc0763de8 ffff937eb67aed00
2018-10-06 11:58:29.664792+0800  localhost com.docker.hyperkit[583]: [91168.812878]  ffffffffa21f5d85 0000000000000000 0000000000000000 ffffbb9dc0763de8
2018-10-06 11:58:29.665681+0800  localhost com.docker.hyperkit[583]: [91168.813694]  ffff937e6df576a0 0000000000000202 ffffffffa27f9dae ffff937eb67aed00
2018-10-06 11:58:29.666082+0800  localhost com.docker.hyperkit[583]: [91168.814585] Call Trace:
2018-10-06 11:58:29.666638+0800  localhost com.docker.hyperkit[583]: [91168.814984]  [<ffffffffa243922a>] ? dump_stack+0x5a/0x6f
2018-10-06 11:58:29.667238+0800  localhost com.docker.hyperkit[583]: [91168.815534]  [<ffffffffa21f5d85>] ? dump_header+0x78/0x1ed
2018-10-06 11:58:29.667980+0800  localhost com.docker.hyperkit[583]: [91168.816144]  [<ffffffffa27f9dae>] ? _raw_spin_unlock_irqrestore+0x16/0x18
2018-10-06 11:58:29.668686+0800  localhost com.docker.hyperkit[583]: [91168.816878]  [<ffffffffa21a1f90>] ? oom_kill_process+0x83/0x324
2018-10-06 11:58:29.669273+0800  localhost com.docker.hyperkit[583]: [91168.817583]  [<ffffffffa21a25b7>] ? out_of_memory+0x239/0x267
2018-10-06 11:58:29.669944+0800  localhost com.docker.hyperkit[583]: [91168.818162]  [<ffffffffa21ef2cd>] ? mem_cgroup_out_of_memory+0x4b/0x79
2018-10-06 11:58:29.670652+0800  localhost com.docker.hyperkit[583]: [91168.818834]  [<ffffffffa21f34a6>] ? mem_cgroup_oom_synchronize+0x26b/0x294
2018-10-06 11:58:29.671338+0800  localhost com.docker.hyperkit[583]: [91168.819560]  [<ffffffffa21ef650>] ? mem_cgroup_is_descendant+0x48/0x48
2018-10-06 11:58:29.671982+0800  localhost com.docker.hyperkit[583]: [91168.820253]  [<ffffffffa21a2612>] ? pagefault_out_of_memory+0x2d/0x6f
2018-10-06 11:58:29.672615+0800  localhost com.docker.hyperkit[583]: [91168.820886]  [<ffffffffa20459b0>] ? __do_page_fault+0x3c6/0x45f
2018-10-06 11:58:29.673158+0800  localhost com.docker.hyperkit[583]: [91168.821516]  [<ffffffffa27fb3c8>] ? page_fault+0x28/0x30
2018-10-06 11:58:29.677517+0800  localhost com.docker.hyperkit[583]: [91168.822159] Task in /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235 killed as a result of limit of /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235
2018-10-06 11:58:29.678200+0800  localhost com.docker.hyperkit[583]: [91168.826457] memory: usage 51188kB, limit 51200kB, failcnt 8764
2018-10-06 11:58:29.678913+0800  localhost com.docker.hyperkit[583]: [91168.827160] memory+swap: usage 102400kB, limit 102400kB, failcnt 78
2018-10-06 11:58:29.679469+0800  localhost com.docker.hyperkit[583]: [91168.827815] kmem: usage 884kB, limit 9007199254740988kB, failcnt 0
2018-10-06 11:58:29.681923+0800  localhost com.docker.hyperkit[583]: [91168.828391] Memory cgroup stats for /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235: cache:20KB rss:50284KB rss_huge:0KB mapped_file:4KB dirty:8KB writeback:0KB swap:51212KB inactive_anon:25264KB active_anon:25020KB inactive_file:8KB active_file:8KB unevictable:0KB
2018-10-06 11:58:29.682630+0800  localhost com.docker.hyperkit[583]: [91168.830820] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
2018-10-06 11:58:29.683423+0800  localhost com.docker.hyperkit[583]: [91168.831586] [50168]     0 50168   499844    14545      94       5    12964             0 java
2018-10-06 11:58:29.684186+0800  localhost com.docker.hyperkit[583]: [91168.832329] Memory cgroup out of memory: Kill process 50168 (java) score 1078 or sacrifice child
2018-10-06 11:58:29.685451+0800  localhost com.docker.hyperkit[583]: [91168.833172] Killed process 50168 (java) total-vm:1999376kB, anon-rss:49108kB, file-rss:9072kB, shmem-rss:0kB
2018-10-06 11:58:29.970702+0800  localhost com.docker.hyperkit[583]: [91169.119073] oom_reaper: reaped process 50168 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2018-10-06 11:58:30.206071+0800  localhost com.docker.driver.amd64-linux[579]: osxfs: die event: de-registering container bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235

即原因是我无意中从 docker-compose 教程之一复制了 resources / limits / memory 设置,因此docker不断杀死我的应用程序。

最终,问题变得微不足道,并且故障排除也不是真正的负担。只需要查看docker守护程序日志。仅由于我对Docker的经验不足,我花了一个晚上试图从Docker容器/群(服务日志,容器日志等)中查找根。好吧,实践很完美:)