Question

我了解到K8无法拉出容器时会发生ImagePullBackOff或ErrImagePull，但是我不认为这种情况会发生。我之所以这样说，是因为随着我的服务扩展，只有 some 个容器随机抛出了此错误，而其他容器的状态很好，处于OK状态。

例如，请在此处参考此副本集。

我从一个失败的Pod中检索了事件。

Events:
  Type     Reason     Age                   From                                                          Message
  ----     ------     ----                  ----                                                          -------
  Normal   Scheduled  3m45s                 default-scheduler                                             Successfully assigned default/storefront-jtonline-prod-6dfbbd6bd8-jp5k5 to gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl
  Normal   Pulling    2m8s (x4 over 3m44s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
  Warning  Failed     2m7s (x4 over 3m43s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Failed to pull image "gcr.io/square1-2019/storefront-jtonline-prod:latest": rpc error: code = Unknown desc = Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
  Warning  Failed     2m7s (x4 over 3m43s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Error: ErrImagePull
  Normal   BackOff    113s (x6 over 3m42s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Back-off pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
  Warning  Failed     99s (x7 over 3m42s)   kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Error: ImagePullBackOff

日志告诉我，由于凭据不正确，无法拉出容器，这似乎令人困惑？与其他组件完全一样，在自动缩放时会自动创建此pod。

我觉得这可能与资源配置有关。当群集由于流量激增而真正快速剥离新节点时，或者在部署配置中设置了较低的资源请求时，我看到的错误率要高得多。

我该如何调试此错误，这可能是什么原因导致的？

这是我的配置：

apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
  name: "storefront-_STOREFRONT-_ENV"
  namespace: "default"
  labels:
    app: "storefront-_STOREFRONT-_ENV"
spec:
  replicas: 10
  selector:
    matchLabels:
      app: "storefront-_STOREFRONT-_ENV"
  template:
    metadata:
      labels:
        app: "storefront-_STOREFRONT-_ENV"
    spec:
      containers:
      - name: "storefront-_STOREFRONT-_ENV"
        image: "gcr.io/square1-2019/storefront-_STOREFRONT-_ENV"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /?healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 1
        imagePullPolicy: Always

apiVersion: "autoscaling/v2beta1"
kind: "HorizontalPodAutoscaler"
metadata:
  name: "storefront-_STOREFRONT-hpa"
  namespace: "default"
  labels:
    app: "storefront-_STOREFRONT-_ENV"
spec:
  scaleTargetRef:
    kind: "Deployment"
    name: "storefront-_STOREFRONT-_ENV"
    apiVersion: "apps/v1beta1"
  minReplicas: 10
  maxReplicas: 1000
  metrics:
  - type: "Resource"
    resource:
      name: "cpu"
      targetAverageUtilization: 75

编辑：我已经能够验证这实际上是一个auth问题。这仅在“某些”容器中发生，因为它仅在由于垂直缩放而自动创建的节点上计划的容器中发生。我还不知道如何解决这个问题。

Answer 1

我们可以在Kubernetes docs regarding images中看到，如果您在GKE上运行集群，则无需执行任何操作。

注意：如果您在Google Kubernetes Engine上运行，则每个节点上已经存在一个.dockercfg，具有用于Google Container Registry的凭据。您不能使用这种方法。

但它也指出：

注意：如果可以控制节点配置，则此方法适用。 它不能在GCE以及任何其他执行自动节点替换的云提供商上可靠地工作。

也在Specifying ImagePullSecrets on a Pod部分。

注意：对于Google Kubernetes Engine，GCE和任何自动创建节点的云提供商，当前推荐使用此方法。

建议在Docker Config中使用创建Secret。

这可以通过以下方式完成：

kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL

Kubernetes吊舱偶尔会抛出“ ImagePullBackOff”或“ ErrImagePull”

1 个答案: