我们有一个Mesos集群,并通过带有Docker容器的Mesos-Slave上的Marathon启动任务。
整个系统运行得非常好但是偶尔会出现一个非常奇怪的问题:当我们尝试通过Marathon销毁/重新部署任务时,mesos-slave因退出目标Docker容器而被杀死。这是我得到的错误日志:
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465544 4094 docker.cpp:1592] Executor for container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465736 4094 docker.cpp:1390] Destroying container 'eadfb756-b653-42eb-977a-c16c78b1a7c5'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465812 4094 docker.cpp:1494] Running docker stop on container 'eadfb756-b653-42eb-977a-c16c78b1a7c5'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466089 4098 slave.cpp:3440] Executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000 exited with status 0
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466167 4098 slave.cpp:3544] Cleaning up executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: F0229 19:31:51.470055 4098 slave.cpp:3570] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: *** Check failure stack trace: ***
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c2144dd google::LogMessage::Fail()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c21621c google::LogMessage::SendToLog()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.566812 4099 docker.cpp:1592] Executor for container 'e2d9c750-88b7-4247-b696-6589665d6a66' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c2140cc google::LogMessage::Flush()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569646 4099 docker.cpp:1390] Destroying container 'e2d9c750-88b7-4247-b696-6589665d6a66'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569757 4099 docker.cpp:1592] Executor for container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569787 4099 docker.cpp:1390] Destroying container 'f51c68b8-c64d-47ea-a629-8516dcc90dba'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569818 4099 docker.cpp:1494] Running docker stop on container 'e2d9c750-88b7-4247-b696-6589665d6a66'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569849 4099 docker.cpp:1494] Running docker stop on container 'f51c68b8-c64d-47ea-a629-8516dcc90dba'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c216b19 google::LogMessageFatal::~LogMessageFatal()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3bc99f2e mesos::internal::slave::Slave::removeExecutor()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3bcaca60 mesos::internal::slave::Slave::executorTerminated()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c1c6541 process::ProcessManager::resume()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c1c683f process::internal::schedule()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3ad4a1e0 (unknown)
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3afa3df5 start_thread
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3a7b41ad __clone
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service: main process exited, code=killed, status=6/ABRT
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: Unit mesos-slave.service entered failed state.
Feb 29 19:32:11 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service holdoff time over, scheduling restart.
在Docker容器中启动的任务是一个AKKA应用程序,整个系统的环境信息是:
操作系统:
CentOS Linux release 7.1.1503 (Core)
内核:
3.10.0-229.el7.x86_64
所有机器上的JDK:
java version "1.7.0_91"
OpenJDK Runtime Environment (rhel-2.6.2.1.el7_1-x86_64 u91-b00)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
Mesos:
0.25, installed by yum from mesosphere repo
Mesos-Master配置:
--zk=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --port=5050 --log_dir=/var/log/mesos --cluster=mesos-prod-cluster --hostname=<real hostname> --ip=<real ip> --quorum=3 --registry_fetch_timeout=5mins --work_dir=/var/lib/mesos
Mesos-Slave配置:
--master=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --log_dir=/var/log/mesos --attributes=env:prod --containerizers=docker,mesos --docker_remove_delay=2weeks --executor_registration_timeout=30mins --hostname=<real slave hostname>
马拉松信息:
{
"name": "marathon",
"version": "0.11.1",
"elected": true,
"leader": "<leader_ip>:8080",
"frameworkId": "8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000",
"marathon_config": {
"master": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster",
"failover_timeout": 604800,
"framework_name": "marathon",
"ha": true,
"checkpoint": true,
"local_port_min": 10000,
"local_port_max": 20000,
"executor": "//cmd",
"hostname": "<hostname>",
"webui_url": null,
"mesos_role": null,
"task_launch_timeout": 600000,
"reconciliation_initial_delay": 15000,
"reconciliation_interval": 300000,
"marathon_store_timeout": 2000,
"mesos_user": "root",
"leader_proxy_connection_timeout_ms": 5000,
"leader_proxy_read_timeout_ms": 10000,
"mesos_leader_ui_url": "http://<leader_ip>:5050/"
},
"zookeeper_config": {
"zk": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/marathon-cluster",
"zk_timeout": 10000,
"zk_session_timeout": 1800000,
"zk_max_versions": 25
},
"event_subscriber": {
"type": "http_callback",
"http_endpoints": null
},
"http_config": {
"assets_path": null,
"http_port": 8080,
"https_port": 8443
}
}
Docker版本:
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:25:01 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:25:01 UTC 2015
OS/Arch: linux/amd64
Docker信息:
Containers: 330
Images: 509
Server Version: 1.9.1
Storage Driver: devicemapper
Pool Name: docker-253:0-68977907-pool
Pool Blocksize: 65.54 kB
Base Device Size: 107.4 GB
Backing Filesystem:
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 23.68 GB
Data Space Total: 107.4 GB
Data Space Available: 27.51 GB
Metadata Space Used: 63.75 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.084 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 15.67 GiB
Name: mesos-slave3.gz.yougola.com
ID: QB4G:C2HK:CBPR:G5ID:6OCU:DFEC:USBP:ECLQ:FWOQ:ZGHS:JIU5:JNN4
Docker,Mesos-Master,Mesos-Slave,Marathon等服务均由systemd管理。
答案 0 :(得分:2)
这很奇怪也很不幸。看起来它没有通过此检查: https://github.com/apache/mesos/blob/0.25.0/src/slave/slave.cpp#L3570 因为它找不到执行程序sentinel文件的路径。
您能否在https://issues.apache.org/jira/browse/MESOS提交新的JIRA,以便我们为您跟踪并解决此问题?