Question

我在具有 GPU 节点的 AKS 集群上运行 Tensorflow 模型。该模型目前在单个 GPU 节点上的单个 Pod 中的单个 TF Serving 容器 (https://hub.docker.com/r/tensorflow/serving) 中运行。

默认情况下，TF 服务容器会占用 pod 中的所有可用 RAM，但我可以在我的 deployment.yaml 文件中缩小容器的内存请求，并且在可接受的处理时间内仍然获得相同的结果。我想知道是否有可能在同一个 GPU 节点上并行运行两个 TF 模型。内存方面它应该可以工作，但是当我尝试将我的部署的副本集调整为两个时，它尝试部署两个 pod，但第二个挂起状态挂起。

$ kubectl get po -n myproject -w
NAME                                 READY   STATUS    RESTARTS   AGE
myproject-deployment-cb7769df4-ljcfc   1/1     Running   0          2m
myproject-deployment-cb7769df4-np9qd   0/1     Pending   0          26s

如果我描述 pod，我会收到以下错误

$ kubectl describe po -n myproject myproject-deployment-cb7769df4-np9qd
Name:           myproject-deployment-cb7769df4-np9qd
Namespace:      myproject
<...>
Events:
  Type     Reason            Age   From                Message
  ----     ------            ----  ----                -------
  Warning  FailedScheduling  105s  default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

由于第一个 Pod“声明”了 GPU，第二个无法再使用它并保持挂起状态。我看到了两种不同的可能性：

在一个 GPU 节点上的一个 Pod 中运行两个 TF 服务容器
运行两个 Pod，每个 Pod 在一个 GPU 节点上有一个 TF 服务容器

以上选项是否可行？

我的部署可以在下面找到。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myproject-deployment
  labels:
    app: myproject-server
  namespace: myproject
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myproject-server
  template:
    metadata:
      labels:
        app: myproject-server
    spec:
      containers:
      - name: server
        image: tensorflow/serving:2.3.0-gpu
        ports:
        - containerPort: 8500
        volumeMounts:
          - name: azurestorage
            mountPath: /models
        resources:
          requests:
            memory: "10Gi"
            cpu: "1"
          limits:
            memory: "12Gi"
            cpu: "2"
            nvidia.com/gpu: 1
        args: ["--model_config_file=/models/models.config", "--monitoring_config_file=/models/monitoring.config"]
      volumes:
      - name: azurestorage
        persistentVolumeClaim:
          claimName: pvcmodels

Answer 1

有趣的问题 - 据我所知，这是不可能的，对于作为同一个 Pod 运行的两个容器也是不可能的（资源是在容器级别配置的），至少不是开箱即用的（参见 https://github.com/kubernetes/kubernetes/issues/52757 )

我在寻找答案时发现了这个：https://blog.ml6.eu/a-guide-to-gpu-sharing-on-top-of-kubernetes-6097935ababf，但这涉及修改 kubernetes 本身。

您可以在同一个容器中运行多个进程来实现共享，但这有点违背 kubernetes/容器的想法，当然不适用于 2 个完全不同的工作负载/服务。

是否可以在单个 GPU 节点 kubernetes 上运行多个 tensorflow 服务容器

1 个答案: