由于3/4节点上的Docker I / O超时而导致Nomad工作挂起?

时间:2019-04-26 15:09:18

标签: amazon-web-services docker consul aws-ecr nomad

由于我想要的容器的docker pull达到了I / O超时的事实,我的工作最终以永恒的等待状态结束。我已经读过几次有关更改DNS的信息以解决此问题,但是这似乎是个骗子,我不需要专用网络上的酒吧google地址... 这是运行后的nomad job ping-services.nomad

○ → nomad job status ping_service
ID            = ping_service
Name          = ping_service
Submit Date   = 2019-04-25T13:29:04-07:00
Type          = service
Priority      = 50
Datacenters   = public-services,private-services,content-connector,backoffice
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
ping_service_group  0       3         1        0       4         0

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
05468ff2  23b79904  ping_service_group  2        run      pending  18h28m ago  19s ago      <- here
5ce4c9ba  1601d6b1  ping_service_group  2        run      pending  18h28m ago  20s ago      <- here
9eced817  2260997a  ping_service_group  2        run      running  18h28m ago  18h28m ago
aefab4c3  032217e1  ping_service_group  2        run      pending  18h28m ago  42s ago      <- and here

运行nomad alloc status 05468ff2

后,您可以看到只有3/4次成功
○ → nomad alloc status 05468ff2
ID                  = 05468ff2
Eval ID             = 10b76231
Name                = ping_service.ping_service_group[1]
Node ID             = 23b79904
Job ID              = ping_service
Job Version         = 2
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 18h35m ago
Modified            = 15s ago

Task "ping_service_task" is "pending"
Task Resources
CPU      Memory  Disk    IOPS  Addresses
100 MHz  20 MiB  50 MiB  0     http: xx.xxx.xxx.xxx:31215

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 982
Last Restart   = 2019-04-26T15:04:01Z

Recent Events:
Time                       Type            Description
2019-04-26T08:04:28-07:00  Driver          Downloading image thobe/ping_service:0.0.9
2019-04-26T08:04:01-07:00  Restarting      Task restarting in 27.061915977s
2019-04-26T08:04:01-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556294011-ftjrcDBBZK4hiQV99v5QZXxvp34%3D: dial tcp 104.18.122.25:443: i/o timeout
2019-04-26T08:03:19-07:00  Driver          Downloading image thobe/ping_service:0.0.9
2019-04-26T08:02:51-07:00  Restarting      Task restarting in 27.302069343s
2019-04-26T08:02:51-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293941-ZUevnKxoKohkLDGDkv5E4A79aZ8%3D: dial tcp 104.18.122.25:443: i/o timeout
2019-04-26T08:02:12-07:00  Driver          Downloading image thobe/ping_service:0.0.9
2019-04-26T08:01:46-07:00  Restarting      Task restarting in 25.629825445s
2019-04-26T08:01:46-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293876-lE4pvy9Jsruduu76LeMoQxL0gxk%3D: dial tcp 104.18.123.25:443: i/o timeout
2019-04-26T08:01:07-07:00  Driver          Downloading image thobe/ping_service:0.0.9

您可以清楚地看到问题是存在一个I / O超时,阻止我们对我们的图层进行切换,因此,在节点上跳转,让我们手动尝试一下...

## Make sure we're really logged into ECR/Docker
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker login
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

## Attempt a manual pull... 
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker pull thobe/ping_service:0.0.9
0.0.9: Pulling from thobe/ping_service
ff3a5c916c92: Pulling fs layer
3c5613eb8e39: Pulling fs layer
error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293601-mrJGlZisGPDvwapT7cAbax7UWig%3D: dial tcp 104.18.125.25:443: i/o timeout

## Are you there God?
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ ping -c1 production.cloudflare.docker.com
PING production.cloudflare.docker.com (104.18.123.25) 56(84) bytes of data.

--- production.cloudflare.docker.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms


## NS of Google Pub DNS
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 8.8.8.8
;; connection timed out; no servers could be reached

## NS of Primary nameserver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.8.8
;; connection timed out; no servers could be reached

## NS of Secondary nameserver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.0.2
Server:     10.128.0.2
Address:    10.128.0.2#53

Non-authoritative answer:
Name:   production.cloudflare.docker.com
Address: 104.18.122.25
Name:   production.cloudflare.docker.com
Address: 104.18.123.25
Name:   production.cloudflare.docker.com
Address: 104.18.124.25
Name:   production.cloudflare.docker.com
Address: 104.18.125.25
Name:   production.cloudflare.docker.com
Address: 104.18.121.25

## Resolver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /usr/sbin/dhclient-script
search nomad-eu-west-1 eu-west-1.compute.internal
nameserver 10.128.8.8
nameserver 10.128.0.2

## What are our current DNS settings?
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /usr/sbin/dhclient-script
search nomad-eu-west-1 eu-west-1.compute.internal
nameserver 10.128.8.8
nameserver 10.128.0.2

坏节点(又称无法拉动的节点)似乎正在发生某些事情。请注意,似乎没有检测到Docker Driver的问题吗?只需在一个错误的节点上注意到这一点,查看节点事件即可。...

○ → nomad node status 23b79904
ID            = 23b79904
Name          = i-xxxxxxx
Class         = <none>
DC            = public-services
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 21h43m20s
Driver Status = docker,exec

Node Events
Time                  Subsystem       Message
2019-04-25T20:39:48Z  Driver: docker  Driver is available and responsive
2019-04-25T20:39:03Z  Driver: docker  Driver docker is not detected
2019-04-25T18:06:53Z  Cluster         Node registered

Allocated Resources
CPU           Memory           Disk            IOPS
500/2399 MHz  128 MiB/983 MiB  300 MiB/48 GiB  0/0

Allocation Resource Utilization
CPU         Memory
5/2399 MHz  14 MiB/983 MiB

Host Resource Utilization
CPU          Memory           Disk
24/2399 MHz  410 MiB/984 MiB  1.8 GiB/50 GiB

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
05468ff2  23b79904  ping_service_group  2        run      pending  19h19m ago  33s ago
9f9ecba6  23b79904  fabio               0        run      running  21h33m ago  21h32m ago

下面的好节点。...

○ → nomad node status 2260997a
ID            = 2260997a
Name          = i-xxxxxxxxx
Class         = <none>
DC            = content-connector
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 21h43m28s
Driver Status = docker,exec

Node Events
Time                  Subsystem  Message
2019-04-25T18:07:04Z  Cluster    Node registered

Allocated Resources
CPU           Memory          Disk           IOPS
100/2400 MHz  20 MiB/983 MiB  50 MiB/48 GiB  0/0

Allocation Resource Utilization
CPU         Memory
0/2400 MHz  6.1 MiB/983 MiB

Host Resource Utilization
CPU          Memory           Disk
23/2400 MHz  361 MiB/984 MiB  1.8 GiB/50 GiB

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
9eced817  2260997a  ping_service_group  2        run      running  19h19m ago  19h19m ago

以下游牧版本

[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nomad -v
Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)

0 个答案:

没有答案