Question

我使用Datadog Helm chart部署了Datadog代理，该代理在Kubernetes中部署了Daemonset。但是，当检查Daemonset的状态时，我看到它没有创建所有Pod：

NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
datadog-agent-datadog   5         2         2       2            2           <none>          1h

在描述Daemonset以找出问题所在时，我发现它没有足够的资源：

Events:
  Type     Reason            Age                From                  Message
  ----     ------            ----               ----                  -------
  Warning  FailedPlacement   42s (x6 over 42s)  daemonset-controller  failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
  Warning  FailedPlacement   42s (x6 over 42s)  daemonset-controller  failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
  Warning  FailedPlacement   42s (x5 over 42s)  daemonset-controller  failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
  Warning  FailedPlacement   42s (x7 over 42s)  daemonset-controller  failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
  Normal   SuccessfulCreate  42s                daemonset-controller  Created pod: datadog-agent-7b2kp

但是，我在群集中安装了Cluster-autoscaler并进行了正确配置（它会在没有足够资源进行调度的常规Pod部署中触发），但似乎并不会在Daemonset：

I0424 14:14:48.545689       1 static_autoscaler.go:273] No schedulable pods
I0424 14:14:48.545700       1 static_autoscaler.go:280] No unschedulable pods

AutoScalingGroup还剩下足够的节点：

我是否错过了集群自动缩放器的配置？我该怎么做才能确保也触发Daemonset资源？

编辑：描述守护进程

Name:           datadog-agent
Selector:       app=datadog-agent
Node-Selector:  <none>
Labels:         app=datadog-agent
                chart=datadog-1.27.2
                heritage=Tiller
                release=datadog-agent
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status:  2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=datadog-agent
  Annotations:      checksum/autoconf-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
                    checksum/checksd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
                    checksum/confd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
  Service Account:  datadog-agent
  Containers:
   datadog:
    Image:      datadog/agent:6.10.1
    Port:       8125/UDP
    Host Port:  0/UDP
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi
    Liveness:  http-get http://:5555/health delay=15s timeout=5s period=15s #success=1 #failure=6
    Environment:
      DD_API_KEY:                  <set to the key 'api-key' in secret 'datadog-secret'>  Optional: false
      DD_LOG_LEVEL:                INFO
      KUBERNETES:                  yes
      DD_KUBERNETES_KUBELET_HOST:   (v1:status.hostIP)
      DD_HEALTH_PORT:              5555
    Mounts:
      /host/proc from procdir (ro)
      /host/sys/fs/cgroup from cgroups (ro)
      /var/run/docker.sock from runtimesocket (ro)
      /var/run/s6 from s6-run (rw)
  Volumes:
   runtimesocket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  
   procdir:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  
   cgroups:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/cgroup
    HostPathType:  
   s6-run:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
Events:
  Type     Reason            Age                 From                  Message
  ----     ------            ----                ----                  -------
  Warning  FailedPlacement   33m (x6 over 33m)   daemonset-controller  failed to place pod on "ip-10-0-2-144.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
  Normal   SuccessfulCreate  33m                 daemonset-controller  Created pod: datadog-agent-7b2kp
  Warning  FailedPlacement   16m (x25 over 33m)  daemonset-controller  failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
  Warning  FailedPlacement   16m (x25 over 33m)  daemonset-controller  failed to place pod on "ip-10-0-2-174.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
  Warning  FailedPlacement   16m (x25 over 33m)  daemonset-controller  failed to place pod on "ip-10-0-3-250.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000

Answer 1

您可以添加priorityClassName以指向DaemonSet的高优先级PriorityClass。然后，Kubernetes将删除其他Pod以运行DaemonSet的Pod。如果导致无法调度的Pod，则cluster-autoscaler应该添加一个节点以对其进行调度。

请参见the docs（基于此的大多数示例）（对于某些1.14之前的版本，apiVersion可能是beta版（1.11-1.13）或alpha版（1.8-1.10））

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority class for essential pods"

将其应用于您的工作负荷

---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: datadog-agent
spec:
  template:
    metadata:
      labels:
        app: datadog-agent
      name: datadog-agent
    spec:
      priorityClassName: high-priority
      serviceAccountName: datadog-agent
      containers:
      - image: datadog/agent:latest
############ Rest of template goes here

Answer 2

您应该了解群集自动缩放器如何工作。仅负责添加或删除节点。它不负责创建或销毁吊舱。因此，在您的情况下，群集自动缩放器不会执行任何操作，因为它没有用。即使您再添加一个节点-仍然需要在CPU不足的节点上运行DaemonSet Pod。这就是为什么它不添加节点。

您应该做的是从占用的节点上手动删除一些吊舱。然后它将能够安排DaemonSet吊舱。

或者，您可以将Datadog的CPU请求减少到例如100m或50m。这足以启动这些吊舱。

群集自动缩放器不会在Daemonset部署上触发放大

2 个答案: