ArangoDB DCOS:协调员在写入数据时死亡

时间:2017-01-09 16:15:04

标签: cluster-computing mesos arangodb dcos

我已经建立了一个DCOS集群并安装了arangodb mesos框架(我还没有改变初始配置)。我可以通过arangodb-proxy访问端口8529上的Web界面,我可以在那里创建数据库,集合和文档。

现在我尝试使用java驱动程序(3.1.4)导入一些数据。在2-3次通话后,协调员发生故障。 Mesos重新启动它,但只要我发送数据,它会在几次请求后立即再次死亡(我也会丢失web界面上的连接几秒钟):

com.arangodb.ArangoException: org.apache.http.NoHttpResponseException: 172.16.100.99:8529 failed to respond

我的插入基本上只是一个创建语句:

arangoDriver.graphCreateVertex(GRAPH_NAME, VERTEX_COLLECTION,
                            getId(), this, true);

ArangoDB代理也抱怨:

I0109 11:26:45.046947 113285 exec.cpp:161] Version: 1.0.1
I0109 11:26:45.051712 113291 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 11:26:45.052942 113293 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e MARATHON_APP_VERSION=2017-01-09T10:26:29.819Z -e HOST=172.16.100.99 -e MARATHON_APP_RESOURCE_CPUS=1.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_DOCKER_IMAGE=arangodb/arangodb-mesos-haproxy -e MESOS_TASK_ID=arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001 -e PORT=8529 -e MARATHON_APP_RESOURCE_MEM=128.0 -e PORTS=8529 -e MARATHON_APP_RESOURCE_DISK=0.0 -e PORT_80=8529 -e MARATHON_APP_LABELS= -e MARATHON_APP_ID=/arangodb-proxy -e PORT0=8529 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0000/executors/arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001/runs/e0db1925-ff85-4454-bd7e-e0f46e502631:/mnt/mesos/sandbox --net bridge -p 8529:80/tcp --entrypoint /bin/sh --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 arangodb/arangodb-mesos-haproxy -c nodejs /configurator.js arangodb3
{ [Error: connect ECONNREFUSED 172.16.100.98:1891]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '172.16.100.98',
  port: 1891 }
{ [Error: connect ECONNREFUSED 172.16.100.99:10413]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '172.16.100.99',
  port: 10413 }

我可以在arangodb服务完成任务列表中看到失败的任务,但是stderr日志似乎没有说什么:


I0109 16:28:31.792980 126177 exec.cpp:161] Version: 1.0.1
I0109 16:28:31.797145 126182 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:28:31.798338 126183 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 4294967296 -e CLUSTER_ROLE=coordinator -e CLUSTER_ID=Coordinator002 -e ADDITIONAL_ARGS= -e AGENCY_ENDPOINTS=tcp://172.16.100.97:1025 tcp://172.16.100.99:1025 tcp://172.16.100.98:1025 -e HOST=172.16.100.99 -e PORT0=1027 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/0c81c1f2-0404-4943-ab81-0abc78763140/runs/61fb92b9-62e0-48b2-b2a3-3dc0b95f7818:/mnt/mesos/sandbox --net bridge -p 1027:8529/tcp --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 arangodb/arangodb-mesos:3.1

Mesos日志表明任务失败:

I0109 16:55:44.821689 13431 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273
I0109 16:55:45.313108 13431 master.cpp:5466] Performing explicit task state reconciliation for 1 tasks of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0003 (marathon-user) at scheduler-f4e239f5-3249-4b48-9bae-24c1e3d3152c@172.16.100.98:42099
I0109 16:55:45.560523 13428 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141655 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273
I0109 16:55:45.676347 13431 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.106:42540 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.823482 13425 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:45.823698 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.105:34374 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.824986 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141656 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:45.826448 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:41364 with User-Agent='python-requests/2.10.0'
I0109 16:55:46.694202 13425 master.cpp:5140] Status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 from agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)@172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.694247 13425 master.cpp:5202] Forwarding status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049
I0109 16:55:46.694344 13425 master.cpp:6844] Updating the state of task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (latest state: TASK_FAILED, status update state: TASK_FAILED)
I0109 16:55:46.695953 13425 master.cpp:4265] Processing ACKNOWLEDGE call 2abcbe87-e1d6-4968-965d-33429573dfd9 for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:55:46.695989 13425 master.cpp:6910] Removing task 32014e7f-7f5b-4fea-b757-cca0faa3deac with resources mem(*):4096; cpus(*):1; disk(*):1024; ports(*):[1027-1027] of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)@172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.824192 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273
I0109 16:55:46.824347 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:46.825814 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141658 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:47.567651 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141657 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273

我有一个相同的系统导入任务在单个arangodb实例上运行没有任何问题,所以我认为问题不在于我的java代码。但是我无法找到任何进一步的日志来指出问题所在(但那时我对Mesos来说还是新手)。

我的群集:

4个Slaves(126GB RAM,每个8个CPU) 3个主机(32 GB RAM,4个CPU)

有谁可以告诉我我做错了什么,或者我可以在哪里找到更多的日志信息?

更新:stdout只显示启动日志消息(查看时间戳,stderr(上面)/ stdout(下面)中的两个条目都是从任务启动时开始,现在是失败时):

--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
Registered docker executor on 172.16.100.99
Starting task 1d0cf1e4-3267-4150-a465-19ecca21fa65
Initializing database...Hang on...
Database initialized...Starting System...
2017-01-10T08:04:30Z [1] INFO ArangoDB 3.1.7 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.1t  3 May 2016
2017-01-10T08:04:30Z [1] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2017-01-10T08:04:30Z [1] WARNING {agencycomm} got an agency redirect from 'http+tcp://172.16.100.97:1025' to 'http+tcp://172.16.100.99:1025'
2017-01-10T08:04:31Z [1] WARNING {agencycomm} Retrying agency communication at 'http+tcp://172.16.100.99:1025', tries: 2
2017-01-10T08:04:31Z [1] INFO Waiting for DBservers to show up...
2017-01-10T08:04:31Z [1] INFO Found 3 DBservers.
2017-01-10T08:04:31Z [1] INFO file-descriptors (nofiles) hard limit is 1048576, soft limit is 1048576
2017-01-10T08:04:31Z [1] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2017-01-10T08:04:32Z [1] INFO Cluster feature is turned on. Agency version: {"server":"arango","version":"3.1.7","license":"community"}, Agency endpoints: http+tcp://172.16.100.99:1025, http+tcp://172.16.100.97:1025, http+tcp://172.16.100.98:1025, server id: 'Coordinator002', internal address: tcp://172.16.100.99:1027, role: COORDINATOR
2017-01-10T08:04:32Z [1] INFO using heartbeat interval value '1000 ms' from agency
2017-01-10T08:04:32Z [1] INFO In database '_system': Database is up-to-date (30107/cluster-local/existing)
2017-01-10T08:04:32Z [1] INFO using endpoint 'http+tcp://0.0.0.0:8529' for non-encrypted requests
2017-01-10T08:04:33Z [1] INFO bootstraped coordinator Coordinator002
2017-01-10T08:04:33Z [1] INFO ArangoDB (version 3.1.7 [linux]) is ready for business. Have fun!

我已经在stderr中注意到arangodb3任务的一些输出,但我不确定这只是记录请求还是部分问题 - 这会每隔几秒重复一次:

I0110 08:50:25.981262    23 HttpServer.cpp:456] handling http request 'GET /v1/endpoints.json'
I0110 08:50:26.000558    22 CaretakerCluster.cpp:470] And here the offer:
{"id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173464"},"framework_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049"},"slave_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S0"},"hostname":"172.16.100.97","url":{"scheme":"http","address":{"hostname":"172.16.100.97","ip":"172.16.100.97","port":5051},"path":"/slave(1)","query":[]},"resources":[{"name":"ports","type":1,"ranges":{"range":[{"begin":1028,"end":2180},{"begin":2182,"end":3887},{"begin":3889,"end":5049},{"begin":5052,"end":8079},{"begin":8082,"end":8180},{"begin":8182,"end":8528},{"begin":8530,"end":32000}]},"role":"*"},{"name":"disk","type":0,"scalar":{"value":291730},"role":"*"},{"name":"cpus","type":0,"scalar":{"value":4.75},"role":"*"},{"name":"mem","type":0,"scalar":{"value":117367},"role":"*"}],"attributes":[],"executor_ids":[]}

更新2:同样,/ mesos中的在线登录显示了这一点 - 这是否意味着群集未正常启动?

I0110 09:55:30.857897 13427 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:60624 with User-Agent='python-requests/2.10.0'
I0110 09:55:31.111609 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173624 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec@172.16.100.98:9546
I0110 09:55:31.111747 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173623 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec@172.16.100.98:9546
I

0 个答案:

没有答案