当已部署的容器完成时,如何使GCE实例停止?

时间:2018-10-10 20:29:56

标签: containers google-compute-engine

我有一个执行单个大型计算的Docker容器。此计算需要大量内存,并且需要大约12个小时才能运行。

我可以创建适当大小的Google Compute Engine VM,并使用“将容器映像部署到此VM实例”选项来完美运行此作业。但是,一旦作业完成,容器就会退出,但VM仍在运行(并且正在充电)。

如何在容器退出时使VM退出/停止/删除?

当VM处于其僵尸模式时,只有堆栈驱动程序容器处于运行状态:

$ docker ps
CONTAINER ID        IMAGE                                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
bfa2feb03180        gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1   "/entrypoint.sh /u..."   17 hours ago        Up 17 hours                             stackdriver-logging-agent
161439a487c2        gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2    "/bin/sh -c /opt/s..."   17 hours ago        Up 17 hours         8000/tcp            stackdriver-metadata-agent

我这样创建VM:

gcloud beta compute --project=abc instances create-with-container vm-name \
                    --zone=us-central1-c --machine-type=custom-1-65536-ext \
                    --network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
                    --maintenance-policy=MIGRATE \
                    --service-account=xyz \
                    --scopes=https://www.googleapis.com/auth/cloud-platform \
                    --image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
                    --boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
                    --container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
                    --container-command=python3 \
                    --container-arg="a" --container-arg="b" --container-arg="c" \
                    --labels=container-vm=cos-stable-69-10895-71-0

3 个答案:

答案 0 :(得分:5)

创建VM时,需要授予其对计算的写入权限,以便您可以从内部删除实例。您还应该在此时设置容器环境变量,例如gce_zonegce_project_id。您将需要他们删除实例。

gcloud beta compute instances create-with-container {NAME} \
    --container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
    --service-account={SERVICE_ACCOUNT} \
    --scopes=https://www.googleapis.com/auth/compute,...
    ...

然后在容器中,每当您确定任务完成时:

  1. 请求一个api令牌(为简化起见,使用curl并使用DEFAULT gce服务帐户)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"

这将以类似

的json进行响应
{
  "access_token": "foobarbaz...",
  "expires_in": 1234,
  "token_type": "Bearer"
}
  1. 获取该访问令牌并点击instances.delete api endpoint(注意环境变量)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME

答案 1 :(得分:5)

解决了一段时间的问题,这是一个很好的完整解决方案。

此解决方案不使用“带有容器映像的启动计算机”选项。相反,它使用启动脚本,该脚本更加灵活。您仍然使用容器优化的OS实例实例。

  1. 创建启动脚本:
#!/usr/bin/env bash

# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")

CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")

# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker

# Run! The logs will go to stack driver 
sudo HOME=/home/root  docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}

# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"

# Run compute delete on the current instance. Need to run in a container 
# because COS machines don't come with gcloud installed 
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME}  --delete-disks=all --zone=${ZONE}
  1. 将脚本放置在公共位置。例如,将其放在Cloud Storage上并创建一个公共URL。您不能将gs:// URI用于COS启动脚本。

  2. 使用startup-script-url启动实例,并传递图像名称和参数,例如:

gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME  \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME

(您可能想限制scopes,为简单起见,该示例使用完全访问权限)

答案 2 :(得分:3)

我根据文森特的答案编写了一个独立的Python函数。

def kill_vm():
    """
    If we are running inside a GCE VM, kill it.
    """
    # based on https://stackoverflow.com/q/52748332/321772
    import json
    import logging
    import requests

    # get the token
    r = json.loads(
        requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
                     headers={"Metadata-Flavor": "Google"})
            .text)

    token = r["access_token"]

    # get instance metadata
    # based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
    project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
                              headers={"Metadata-Flavor": "Google"}).text

    name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
                        headers={"Metadata-Flavor": "Google"}).text

    zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
                             headers={"Metadata-Flavor": "Google"}).text
    zone = zone_long.split("/")[-1]

    # shut ourselves down
    logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))

    requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
                    .format(project_id=project_id, zone=zone, name=name),
                    headers={"Authorization": "Bearer {token}".format(token=token)})

一个简单的atexit钩子让我得到了我想要的行为:

import atexit
atexit.register(kill_vm)