Question

我正在尝试使用acs-engine在Azure中设置群集，以使用VMSS作为代理池来构建Kubernetes群集。群集启动后，我添加了cluster-autoscaler以管理2个专用代理程序池，1个cpu和1个gpu。只要规模集中仍具有运行中的VM，就可以进行规模缩小和规模扩大。两种比例尺都设置为缩小到0。使用ACS，我已经用污点和自定义标签设置了这两种比例尺设置。一旦缩放比例集缩小到0，计划新的Pod时，我将无法使自动缩放器旋转回一个节点。我不确定自己做错了什么，或者不确定是否缺少一些配置，标签，异味等。我刚刚开始使用kubernetes。

下面是我的acs-engine json，pod定义以及自动定标器和pod的日志描述。

kubectl logs -n kube-system cluster-autoscaler-5967b96496-jnvjr的输出

I0920 16:11:14.925761       1 scale_up.go:249] Pod default/my-test-pod is unschedulable
I0920 16:11:14.999323       1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool2-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool2-24760778-vmss-6220731686255962863, reason: node(s) didn't match node selector
I0920 16:11:14.999408       1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool3-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool3-24760778-vmss-3043543739698957784, reason: node(s) didn't match node selector
I0920 16:11:14.999442       1 scale_up.go:376] No expansion options

kubectl describe pod my-test-pod的输出

Name:               my-test-pod
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"my-test-pod","namespace":"default"},"spec":{"affinity":{"nodeAffinity":{"preferred...
Status:             Pending
IP:
Containers:
  my-test-pod:
    Image:      ubuntu:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -ec
      while :; do echo '.'; sleep 5; done
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qzm6s (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-qzm6s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qzm6s
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  agentpool=pool2
                 environment=DEV
                 hardware=cpu-spec
                 node-template=k8s-pool2-24760778-vmss
                 vmSize=Standard_D4s_v3
Tolerations:     dedicated=pool2:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                 From                Message
  ----     ------             ----                ----                -------
  Warning  FailedScheduling   2m (x273 over 17m)  default-scheduler   0/3 nodes are available: 3 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  2m (x89 over 17m)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)

acs引擎配置文件（使用Terraform渲染和生成）

{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "orchestratorRelease": "1.11",
        "kubernetesConfig": {
          "networkPlugin": "azure",
          "clusterSubnet": "${cidr}",
          "privateCluster": {
            "enabled": true
          },
          "addons": [
            {
              "name": "nvidia-device-plugin",
              "enabled": true
            },
            {
              "name": "cluster-autoscaler",
              "enabled": true,
              "config": {
                "minNodes": "0",
                "maxNodes": "2",
                "image": "gcr.io/google-containers/cluster-autoscaler:1.3.1"
              }
            }
          ]
        }
      },
      "masterProfile": {
        "count": ${master_vm_count},
        "dnsPrefix": "${dns_prefix}",
        "vmSize": "${master_vm_size}",
        "storageProfile": "ManagedDisks",
        "vnetSubnetId": "${pool_subnet_id}",
        "firstConsecutiveStaticIP": "${first_master_ip}",
        "vnetCidr": "${cidr}"
      },
      "agentPoolProfiles": [
        {
          "name": "pool3",
          "count": ${dedicated_vm_count},
          "vmSize": "${dedicated_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "customNodeLabels": {
              "vmSize":"${dedicated_vm_size}",
              "dedicatedOnly": "true",
              "environment":"${environment}",
              "hardware": "${dedicated_spec}"
          },
          "availabilityProfile": "VirtualMachineScaleSets",
          "scaleSetEvictionPolicy": "Delete",
          "kubernetesConfig": {
            "kubeletConfig": {
              "--register-with-taints": "dedicated=pool3:NoSchedule"
            }
          }
        },
        {
          "name": "pool2",
          "count": ${pool2_vm_count},
          "vmSize": "${pool2_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "availabilityProfile": "VirtualMachineScaleSets",
          "customNodeLabels": {
              "vmSize":"${pool2_vm_size}",
              "environment":"${environment}",
              "hardware": "${pool_spec}"
          },
          "kubernetesConfig": {
            "kubeletConfig": {
              "--register-with-taints": "dedicated=pool2:NoSchedule"
            }
          }
    },
        {
          "name": "pool1",
          "count": ${pool1_vm_count},
          "vmSize": "${pool1_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "availabilityProfile": "VirtualMachineScaleSets",
          "customNodeLabels": {
              "vmSize":"${pool1_vm_size}",
              "environment":"${environment}",
              "hardware": "${pool_spec}"
          }
        }
      ],
      "linuxProfile": {
        "adminUsername": "${admin_user}",
        "ssh": {
          "publicKeys": [
            {
              "keyData": "${ssh_key}"
            }
          ]
        }
      },
      "servicePrincipalProfile": {
        "clientId": "${service_principal_client_id}",
        "secret": "${service_principal_client_secret}"
      }
    }
  }

Pod配置文件

apiVersion: v1
kind: Pod
metadata:
  name: my-test-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: vmSize
            operator: In
            values:
              - Standard_D4s_v3
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: hardware
                operator: In
                values:
                - cpu-spec
  nodeSelector:
    agentpool: pool2
    hardware: cpu-spec
    vmSize: Standard_D4s_v3
    environment: DEV
    node-template: k8s-pool2-24760778-vmss
  tolerations:
    - key: dedicated
      operator: Equal
      value: pool2
      effect: NoSchedule
  containers:
    - name: my-test-pod
      image: ubuntu:latest
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
  restartPolicy: Never

我尝试了在nodeAffinity / nodeSelector / Tolerations中添加和删除它们的变体，所有这些都具有相同的结果。

集群启动后，我确实将pool2添加到自动缩放器中。在Internet上寻找解决方案时，我会不断浏览有关节点模板标签的文章，我认为形式是k8s.io/autoscaler/cluster-autoscaler/node-template/label/value，但这似乎是必需的适用于AWS。

有人可以在Azure上为我提供任何指导吗？

谢谢。

Answer 1

更新。

我已经找到答案。通过删除 requiredDuringSchedulingIgnoreDuringExecution 节点关联性规则，并仅使用 preferredDuringSchedulingIgnoreDuringExecution ，调度程序就可以在规模集中正确启动新的VM。

apiVersion: v1
kind: Pod
metadata:
  name: my-test-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: hardware
                operator: In
                values:
                - cpu-spec
  nodeSelector:
    agentpool: pool2
    hardware: cpu-spec
    vmSize: Standard_D4s_v3
    environment: DEV
    node-template: k8s-pool2-24760778-vmss
  tolerations:
    - key: dedicated
      operator: Equal
      value: pool2
      effect: NoSchedule
  containers:
    - name: my-test-pod
      image: ubuntu:latest
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
  restartPolicy: Never

在ACS-Engine

1 个答案: