具有持久性批量应用程序的Mesos Marathon应用程序暂停

时间:2016-05-24 11:55:18

标签: mesos marathon

我在使用持久本地卷在Marathon中运行应用时遇到问题。跟随instructions,使用角色和主体启动Marathon并创建一个具有持久卷的简单应用程序,它只是暂停挂起。似乎奴隶已经回复了有效的报价,但实际上无法启动应用程序。即使我使用调试选项进行编译并使用GLOG_v=2打开日志记录,从站也不会记录有关任务的任何内容。

马拉松似乎也在不断启动任务ID,但我无法在任何地方看到原因。

奇怪的是,当我在没有持续音量的情况下运行时,但是通过磁盘预留,应用程序开始运行。

Marathon上的调试日志似乎没有显示任何有用的东西,但是我可能会遗漏一些东西。任何人都可以给我任何关于问题可能是什么或在哪里寻找额外调试的指针?非常感谢提前。

以下是有关我的环境和调试信息的一些信息:

Slave :Ubuntu 14.04运行0.28预建,并在源代码的0.29中测试

Master :Mesos 0.28在CoreOS上的Docker Ubuntu 14.04映像中运行

Marathon :1.1.1在CoreOS上的Docker Ubuntu 14.04映像中运行

具有持久存储空间的应用

来自v2/apps/test/tasks的关于Marathon的应用信息

{
  "app": {
    "id": "/test",
    "cmd": "while true; do sleep 10; done",
    "args": null,
    "user": null,
    "env": {},
    "instances": 1,
    "cpus": 1,
    "mem": 128,
    "disk": 0,
    "executor": "",
    "constraints": [
      [
        "role",
        "CLUSTER",
        "persistent"
      ]
    ],
    "uris": [],
    "fetch": [],
    "storeUrls": [],
    "ports": [
      10002
    ],
    "portDefinitions": [
      {
        "port": 10002,
        "protocol": "tcp",
        "labels": {}
      }
    ],
    "requirePorts": false,
    "backoffSeconds": 1,
    "backoffFactor": 1.15,
    "maxLaunchDelaySeconds": 3600,
    "container": {
      "type": "MESOS",
      "volumes": [
        {
          "containerPath": "test",
          "mode": "RW",
          "persistent": {
            "size": 100
          }
        }
      ]
    },
    "healthChecks": [],
    "readinessChecks": [],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 0.5,
      "maximumOverCapacity": 0
    },
    "labels": {},
    "acceptedResourceRoles": null,
    "ipAddress": null,
    "version": "2016-05-19T11:31:54.861Z",
    "residency": {
      "relaunchEscalationTimeoutSeconds": 3600,
      "taskLostBehavior": "WAIT_FOREVER"
    },
    "versionInfo": {
      "lastScalingAt": "2016-05-19T11:31:54.861Z",
      "lastConfigChangeAt": "2016-05-18T16:46:59.684Z"
    },
    "tasksStaged": 0,
    "tasksRunning": 0,
    "tasksHealthy": 0,
    "tasksUnhealthy": 0,
    "deployments": [
      {
        "id": "4f3779e5-a805-4b95-9065-f3cf9c90c8fe"
      }
    ],
    "tasks": [
      {
        "id": "test.4b7d4303-1dc2-11e6-a179-a2bd870b1e9c",
        "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S17",
        "host": "ip-10-0-90-61.eu-west-1.compute.internal",
        "localVolumes": [
          {
            "containerPath": "test",
            "persistenceId": "test#test#4b7d4302-1dc2-11e6-a179-a2bd870b1e9c"
          }
        ],
        "appId": "/test"
      }
    ]
  }
}

Marathon中的应用信息:(似乎部署在旋转)

Stuck at waiting instance info (screenshot)

没有持久存储的应用程序

来自v2/apps/test2/tasks的关于Marathon的应用信息

{
  "app": {
    "id": "/test2",
    "cmd": "while true; do sleep 10; done",
    "args": null,
    "user": null,
    "env": {},
    "instances": 1,
    "cpus": 1,
    "mem": 128,
    "disk": 100,
    "executor": "",
    "constraints": [
      [
        "role",
        "CLUSTER",
        "persistent"
      ]
    ],
    "uris": [],
    "fetch": [],
    "storeUrls": [],
    "ports": [
      10002
    ],
    "portDefinitions": [
      {
        "port": 10002,
        "protocol": "tcp",
        "labels": {}
      }
    ],
    "requirePorts": false,
    "backoffSeconds": 1,
    "backoffFactor": 1.15,
    "maxLaunchDelaySeconds": 3600,
    "container": null,
    "healthChecks": [],
    "readinessChecks": [],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 0.5,
      "maximumOverCapacity": 0
    },
    "labels": {},
    "acceptedResourceRoles": null,
    "ipAddress": null,
    "version": "2016-05-19T13:44:01.831Z",
    "residency": null,
    "versionInfo": {
      "lastScalingAt": "2016-05-19T13:44:01.831Z",
      "lastConfigChangeAt": "2016-05-19T13:09:20.106Z"
    },
    "tasksStaged": 0,
    "tasksRunning": 1,
    "tasksHealthy": 0,
    "tasksUnhealthy": 0,
    "deployments": [],
    "tasks": [
      {
        "id": "test2.bee624f1-1dc7-11e6-b98e-568f3f9dead8",
        "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S18",
        "host": "ip-10-0-90-61.eu-west-1.compute.internal",
        "startedAt": "2016-05-19T13:44:02.190Z",
        "stagedAt": "2016-05-19T13:44:02.023Z",
        "ports": [
          31926
        ],
        "version": "2016-05-19T13:44:01.831Z",
        "ipAddresses": [
          {
            "ipAddress": "10.0.90.61",
            "protocol": "IPv4"
          }
        ],
        "appId": "/test2"
      }
    ],
    "lastTaskFailure": {
      "appId": "/test2",
      "host": "ip-10-0-90-61.eu-west-1.compute.internal",
      "message": "Slave ip-10-0-90-61.eu-west-1.compute.internal removed: health check timed out",
      "state": "TASK_LOST",
      "taskId": "test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c",
      "timestamp": "2016-05-19T13:15:24.155Z",
      "version": "2016-05-19T13:09:20.106Z",
      "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S17"
    }
  }
}

运行应用程序时的从属日志:

I0519 13:09:22.471876 12459 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.471906 12459 status_update_manager.cpp:497] Creating StatusUpdate stream for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.472262 12459 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.477686 12459 status_update_manager.cpp:374] Forwarding update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to the agent
I0519 13:09:22.477830 12453 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.477814016+00:00
I0519 13:09:22.477967 12453 slave.cpp:3638] Forwarding the update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to master@10.0.82.230:5050
I0519 13:09:22.478185 12453 slave.cpp:3532] Status update manager successfully handled status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.478229 12453 slave.cpp:3548] Sending acknowledgement for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to executor(1)@10.0.90.61:34262
I0519 13:09:22.488315 12460 pid.cpp:95] Attempting to parse 'master@10.0.82.230:5050' into a PID
I0519 13:09:22.488370 12460 process.cpp:646] Parsed message name 'mesos.internal.StatusUpdateAcknowledgementMessage' for slave(1)@10.0.90.61:5051 from master@10.0.82.230:5050
I0519 13:09:22.488452 12452 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.488441856+00:00
I0519 13:09:22.488600 12458 process.cpp:2605] Resuming (14)@10.0.90.61:5051 at 2016-05-19 13:09:22.488590080+00:00
I0519 13:09:22.488632 12458 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.488726 12458 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
I0519 13:09:22.492985 12452 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.492974080+00:00
I0519 13:09:22.493021 12452 slave.cpp:2629] Status update manager successfully handled status update acknowledgement (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000

1 个答案:

答案 0 :(得分:0)

可能是由于磁盘空间不足或RAM造成的。 最小空闲配置在以下link

中指定