Docker引擎在Azure Batch节点上失败

时间:2017-10-12 11:03:00

标签: azure docker azure-batch docker-engine

方案

我创建了一个包含多个节点的池(基本映像是Ubuntu Server 16.04),并提供以下启动命令: /bin/bash -c 'set -o pipefail; export DEBIAN_FRONTEND=noninteractive ; sudo -E apt update ; sudo -E apt upgrade -y ; sudo -E apt-get install -y --no-install-recommends apt-transport-https curl software-properties-common ; curl -fsSL "https://sks-keyservers.net/pks/lookup?op=get&search=0xee6d536cf7dc86e2d7d56f59a178ac6c6238f52e" | sudo -E apt-key add - ; sudo -E apt-add-repository "deb https://packages.docker.com/1.13/apt/repo/ ubuntu-$(lsb_release -cs) main" ; sudo -E apt-get update ; sudo -E apt-get install -y docker-engine ; sudo usermod -a -G docker $USER ; sudo -E service docker start ; journalctl -xe; wait'

该命令用于安装Docker Engine的唯一目的。另请注意,我删除了set -e选项,以便能够运行命令journalctl -xe并捕获以下错误。

错误

创建上述池时,某些节点将无法启动任务。该行为似乎是随机的,因为并非总是节点失败,并且如上所述,其他节点不会失败。 行为不依赖于节点的大小(我尝试了D2_v3和NC6)。

这是journalctl -xe

的输出
Oct 12 09:19:40 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: Listening on Docker Socket for the API.
-- Subject: Unit docker.socket has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.socket has finished starting up.
-- 
-- The start-up result is done.
Oct 12 09:19:40 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has begun starting up.
Oct 12 09:19:40 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:40.605332263Z" level=info msg="libcontainerd: new containerd process, pid: 24492"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.608293321Z" level=info msg="[graphdriver] using prior storage driver: aufs"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626089049Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626378756Z" level=warning msg="Your kernel does not support swap memory limit"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626558660Z" level=warning msg="Your kernel does not support cgroup rt period"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626698864Z" level=warning msg="Your kernel does not support cgroup rt runtime"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626834867Z" level=warning msg="Your kernel does not support cgroup blkio weight"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.626970070Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.627384080Z" level=info msg="Loading containers: start."
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.630900065Z" level=info msg="Firewalld running: false"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.661877309Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A kernel: IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
Oct 12 09:19:41 7d8bb094c57c400582f6031d59f1630000000A dockerd[24473]: time="2017-10-12T09:19:41.996853856Z" level=info msg="Loading containers: done."
Oct 12 09:19:42 7d8bb094c57c400582f6031d59f1630000000A kernel: aufs au_opts_verify:1585:dockerd[24490]: dirperm1 breaks the protection by the permission bits on the lower branch
Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: docker.service: Main process exited, code=killed, status=11/SEGV
Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has failed.
-- 
-- The result is failed.
Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: docker.service: Unit entered failed state.
Oct 12 09:19:45 7d8bb094c57c400582f6031d59f1630000000A systemd[1]: docker.service: Failed with result 'signal'.

在创建网络界面时似乎出现了问题,但我不确定是什么,尤其是如何修复它。

1 个答案:

答案 0 :(得分:0)

更新后的答案,2017-10-18:

此问题已通过Canonical UbuntuServer 16.04-LTS的latest平台映像修复,并再次与Go / Docker配合使用。

原始答案:

您的代码没有任何问题。 Canonical UbuntuServer 16.04-LTS 201709190平台图像(此时也是latest)和Go / Docker有一个issue

在问题解决后,暂时将图像版本设置为201708151

顺便说一下:如果您使用的是Docker和Azure Batch,则应该查看提供此功能的Batch Shipyard。 (完全披露:我是此代码的撰稿人。)