为什么Mesos框架没有提供资源?

时间:2017-10-03 13:04:52

标签: mesos

我正在使用Mesos 1.0.1。我添加了一个具有新角色docker_gpu_worker的代理。我用这个角色注册了一个框架。该框架未收到任何优惠。使用其他角色的其他框架(相同的Java代码)工作正常。我没有重新启动三个Mesos大师。有没有人知道可能出现的问题?

master/frameworks,我看到了我的框架:

"{
  "id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
  "name": "/data4/Users/mikeb/jobs/999",
  "pid": "scheduler-77345362-b85c-4044-8db5-0106b9015119@x.x.x.x:57617",
  "used_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "offered_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "capabilities": [],
  "hostname": "x-x-x-x.ec2.internal",
  "webui_url": "",
  "active": true,
  "user": "mikeb",
  "failover_timeout": 10080,
  "checkpoint": true,
  "role": "docker_gpu_worker",
  "registered_time": 1507028279.18887,
  "unregistered_time": 0,
  "principal": "test-framework-java",
  "resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "tasks": [],
  "completed_tasks": [],
  "offers": [],
  "executors": []
}"

master/roles,我看到了自己的角色:

"{
  "frameworks": [
    "fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
    "fd01b1b0-eb73-4d40-8774-009171ae1db1-0673",
    "fd01b1b0-eb73-4d40-8774-009171ae1db1-0335"
  ],
  "name": "docker_gpu_worker",
  "resources": {
    "cpus": 0,
    "disk": 0,
    "gpus": 0,
    "mem": 0
  },
  "weight": 1
}"

master/slaves我看到我的经纪人:

"{
  "id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-S5454",
  "pid": "slave(1)@x.x.x.x:5051",
  "hostname": "x-x-x-x.ec2.internal",
  "registered_time": 1506692213.24938,
  "resources": {
    "disk": 35056,
    "mem": 59363,
    "gpus": 4,
    "cpus": 32,
    "ports": "[31000-32000]"
  },
  "used_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "offered_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "reserved_resources": {
    "docker_gpu_worker": {
      "disk": 35056,
      "mem": 59363,
      "gpus": 4,
      "cpus": 32,
      "ports": "[31000-32000]"
    }
  },
  "unreserved_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "attributes": {},
  "active": true,
  "version": "1.0.1",
  "reserved_resources_full": {
    "docker_gpu_worker": [
      {
        "name": "gpus",
        "type": "SCALAR",
        "scalar": {
          "value": 4
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "cpus",
        "type": "SCALAR",
        "scalar": {
          "value": 32
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "mem",
        "type": "SCALAR",
        "scalar": {
          "value": 59363
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "disk",
        "type": "SCALAR",
        "scalar": {
          "value": 35056
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "ports",
        "type": "RANGES",
        "ranges": {
          "range": [
            {
              "begin": 31000,
              "end": 32000
            }
          ]
        },
        "role": "docker_gpu_worker"
      }
    ]
  },
  "used_resources_full": [],
  "offered_resources_full": []
}"

我们已经跟踪了这个Mesos代理配置的问题:

--isolation="filesystem/linux,cgroups/devices,gpu/nvidia"

删除它,代理正常工作,但无法访问GPU资源。根据Nvidia GPU支持的docs,此配置是一项要求,这些文档似乎表明版本1.0.1支持它。我们正在继续调查。

2 个答案:

答案 0 :(得分:0)

必须为框架启用GPU_RESOURCES功能。

http://mesos.readthedocs.io/en/latest/gpu-support/所示, 这可以通过在--framework_capabilities="GPU_RESOURCES"命令中指定mesos-execute或在C ++中使用这样的代码来实现:

FrameworkInfo framework;
framework.add_capabilities()->set_type(
    FrameworkInfo::Capability::GPU_RESOURCES);

对于Marathon框架,必须使用Enable GPU resources (CUDA) on DC/OS

中指示的--enable_features "gpu_resources"选项启动Marathon服务

答案 1 :(得分:-1)

您可以静态注册master的角色, 如果您在运行时添加代理角色,则无法掌握 并且它需要mesos master重启以便master看到这个角色。 尝试重新启动mesos master。