我有一个执行单个大型计算的Docker容器。此计算需要大量内存,并且需要大约12个小时才能运行。
我可以创建适当大小的Google Compute Engine VM,并使用“将容器映像部署到此VM实例”选项来完美运行此作业。但是,一旦作业完成,容器就会退出,但VM仍在运行(并且正在充电)。
如何在容器退出时使VM退出/停止/删除?
当VM处于其僵尸模式时,只有堆栈驱动程序容器处于运行状态:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bfa2feb03180 gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1 "/entrypoint.sh /u..." 17 hours ago Up 17 hours stackdriver-logging-agent
161439a487c2 gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2 "/bin/sh -c /opt/s..." 17 hours ago Up 17 hours 8000/tcp stackdriver-metadata-agent
我这样创建VM:
gcloud beta compute --project=abc instances create-with-container vm-name \
--zone=us-central1-c --machine-type=custom-1-65536-ext \
--network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
--maintenance-policy=MIGRATE \
--service-account=xyz \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
--boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
--container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
--container-command=python3 \
--container-arg="a" --container-arg="b" --container-arg="c" \
--labels=container-vm=cos-stable-69-10895-71-0
答案 0 :(得分:5)
创建VM时,需要授予其对计算的写入权限,以便您可以从内部删除实例。您还应该在此时设置容器环境变量,例如gce_zone
和gce_project_id
。您将需要他们删除实例。
gcloud beta compute instances create-with-container {NAME} \
--container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
--service-account={SERVICE_ACCOUNT} \
--scopes=https://www.googleapis.com/auth/compute,...
...
然后在容器中,每当您确定任务完成时:
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"
这将以类似
的json进行响应{
"access_token": "foobarbaz...",
"expires_in": 1234,
"token_type": "Bearer"
}
instances.delete
api endpoint(注意环境变量)curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME
答案 1 :(得分:5)
解决了一段时间的问题,这是一个很好的完整解决方案。
此解决方案不使用“带有容器映像的启动计算机”选项。相反,它使用启动脚本,该脚本更加灵活。您仍然使用容器优化的OS实例实例。
#!/usr/bin/env bash
# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")
CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")
# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker
# Run! The logs will go to stack driver
sudo HOME=/home/root docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}
# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"
# Run compute delete on the current instance. Need to run in a container
# because COS machines don't come with gcloud installed
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME} --delete-disks=all --zone=${ZONE}
将脚本放置在公共位置。例如,将其放在Cloud Storage上并创建一个公共URL。您不能将gs://
URI用于COS启动脚本。
使用startup-script-url
启动实例,并传递图像名称和参数,例如:
gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME
(您可能想限制scopes
,为简单起见,该示例使用完全访问权限)
答案 2 :(得分:3)
我根据文森特的答案编写了一个独立的Python函数。
def kill_vm():
"""
If we are running inside a GCE VM, kill it.
"""
# based on https://stackoverflow.com/q/52748332/321772
import json
import logging
import requests
# get the token
r = json.loads(
requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
headers={"Metadata-Flavor": "Google"})
.text)
token = r["access_token"]
# get instance metadata
# based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
headers={"Metadata-Flavor": "Google"}).text
name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
headers={"Metadata-Flavor": "Google"}).text
zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
headers={"Metadata-Flavor": "Google"}).text
zone = zone_long.split("/")[-1]
# shut ourselves down
logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))
requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
.format(project_id=project_id, zone=zone, name=name),
headers={"Authorization": "Bearer {token}".format(token=token)})
一个简单的atexit
钩子让我得到了我想要的行为:
import atexit
atexit.register(kill_vm)