Kubernetes作业持续旋转Pod,最终显示为``错误''状态

时间:2020-06-18 05:47:08

标签: kubernetes

我正在做一个Kubernetes cron工作,它代表一个集成测试; Go测试二进制文件是用go test -c编译并复制到cron作业运行的Docker容器中的。 Kubernetes YAML的启动类似于以下内容:

apiVersion: batch/v1beta1
kind: CronJob
spec:
  schedule: "*/15 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 7
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never

在某个时候,集成测试开始失败(以代码1退出)。我可以看到该工作的持续时间与其年龄相同:

$ kubectl get jobs -l app=integration-test
NAME                          COMPLETIONS   DURATION   AGE
integration-test-1592457300   0/1           7m20s      7m20s

kubectl get pods命令显示,像我从cron计划中所期望的那样,豆荚的创建频率比每15分钟更频繁:

$ kubectl get pods -l app=integration-test
NAME                                READY   STATUS   RESTARTS   AGE
integration-test-1592457300-224x8   0/1     Error    0          92s
integration-test-1592457300-5f8sz   0/1     Error    0          7m33s
integration-test-1592457300-9zvjq   0/1     Error    0          3m57s
integration-test-1592457300-th7sf   0/1     Error    0          6m26s
integration-test-1592457300-vhbr2   0/1     Error    0          5m17s

这种分解新Pod的行为是有问题的,因为它会导致节点上运行的Pod数量增加-本质上,它会消耗资源。

如何使cron作业不会继续旋转新的pod,而只能每15分钟执行一次,并且在作业失败时不会继续消耗资源?

更新

使用https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/改编的Kubernetes YAML作为简化示例:

$ cat cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; exit 1
          restartPolicy: Never

请注意,它以代码1退出。如果我使用kubernetes apply -f cronjob.yaml运行此代码,然后检查Pod,我会看到

$ kubectl get pods
NAME                                                    READY   STATUS      RESTARTS   AGE
hello-1592459760-fnvcw                                  0/1     Error       0          30s
hello-1592459760-w75lt                                  0/1     Error       0          31s
hello-1592459760-xzhwn                                  0/1     Error       0          20s

豆荚的年龄相距不到一分钟;换句话说,豆荚在cron间隔过去之前就旋转了。我该如何预防?

2 个答案:

答案 0 :(得分:3)

这是非常特定的情况,很难猜测您想要实现什么以及它是否对您有用。

concurrencyPolicy: Forbid阻止创建另一个job,如果先前不是completed。但我认为情况并非如此。

restartPolicy适用于pod(但是在Job template中,您只能使用OnFailureNever)。如果将restartPolicy设置为Never,则job将自动创建新的pods,直到完成。

作业会创建一个或多个Pod,并确保已成功终止其中指定数量的Pod。吊舱成功完成后,工作将跟踪成功完成的情况。

如果您设置restartPolicy: Never,它将一直创建pod,直到达到backoffLimit,但是这些pods仍将在集群中以每个{pod}的状态显示为Errorstatus 1退出。您需要手动将其删除。 如果您设置restartPolicy: OnFailure,它将重新启动一个pod,并且不会创建更多内容。

但是还有另一种方式。什么是completed工作?

示例:

1。 restartPolicy: OnFailure

$ kubectl get po,jobs,cronjob
NAME                         READY   STATUS             RESTARTS   AGE
pod/hello-1592495280-w27mt   0/1     CrashLoopBackOff   5          5m21s
pod/hello-1592495340-tzc64   0/1     CrashLoopBackOff   5          4m21s
pod/hello-1592495400-w8cm6   0/1     CrashLoopBackOff   5          3m21s
pod/hello-1592495460-jjlx5   0/1     CrashLoopBackOff   4          2m21s
pod/hello-1592495520-c59tm   0/1     CrashLoopBackOff   3          80s
pod/hello-1592495580-rrdzw   0/1     Error              2          20s
NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592495220   0/1           6m22s      6m22s
job.batch/hello-1592495280   0/1           5m22s      5m22s
job.batch/hello-1592495340   0/1           4m22s      4m22s
job.batch/hello-1592495400   0/1           3m22s      3m22s
job.batch/hello-1592495460   0/1           2m22s      2m22s
job.batch/hello-1592495520   0/1           81s        81s
job.batch/hello-1592495580   0/1           21s        21s
NAME                  SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/hello   */1 * * * *   False     6        25s             15m

每个job将只创建1个pod,直到jobfinished或{{1}视为completed }。

如果您将在CronJob部分中描述CronJob,则可以找到。

Event

为什么将作业Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 18m cronjob-controller Created job hello-1592494740 Normal SuccessfulCreate 17m cronjob-controller Created job hello-1592494800 Normal SuccessfulCreate 16m cronjob-controller Created job hello-1592494860 Normal SuccessfulCreate 15m cronjob-controller Created job hello-1592494920 Normal SuccessfulCreate 14m cronjob-controller Created job hello-1592494980 Normal SuccessfulCreate 13m cronjob-controller Created job hello-1592495040 Normal SawCompletedJob 12m cronjob-controller Saw completed job: hello-1592494740 Normal SuccessfulCreate 12m cronjob-controller Created job hello-1592495100 Normal SawCompletedJob 11m cronjob-controller Saw completed job: hello-1592494800 Normal SuccessfulDelete 11m cronjob-controller Deleted job hello-1592494740 Normal SuccessfulCreate 11m cronjob-controller Created job hello-1592495160 Normal SawCompletedJob 10m cronjob-controller Saw completed job: hello-1592494860 视为hello-1592494740Completed的{​​{1}}默认值为6(此信息可在docs中找到)。如果Cronjob将失败6次(pod将无法重新启动6次),.spec.backoffLimit会将此job视为Cronjob并将其删除。在job被删除之后,Completed也将被删除。

但是,在您的示例中,创建了job,执行了pod操作日期和回显命令,然后以代码1退出。即使pod崩溃了,它也会写信息。由于最后一个命令是pod,因此它将崩溃直到达到限制。按照下面的示例:

pod

2。 exit 1$ kubectl get pods NAME READY STATUS RESTARTS AGE hello-1592495400-w8cm6 0/1 Terminating 6 5m51s hello-1592495460-jjlx5 0/1 CrashLoopBackOff 5 4m51s hello-1592495520-c59tm 0/1 CrashLoopBackOff 5 3m50s hello-1592495580-rrdzw 0/1 CrashLoopBackOff 4 2m50s hello-1592495640-nbq59 0/1 CrashLoopBackOff 4 110s hello-1592495700-p6pcx 0/1 Error 3 50s user@cloudshell:~ (project)$ kubectl logs hello-1592495520-c59tm Thu Jun 18 15:55:13 UTC 2020 Hello from the Kubernetes cluster

使用了以下YAML:

restartPolicy: Never

输出

backoffLimit: 0

这样,只有一个apiVersion: batch/v1beta1 kind: CronJob metadata: name: hello spec: schedule: "*/1 * * * *" jobTemplate: spec: template: spec: containers: - name: hello image: busybox args: - /bin/sh - -c - date; echo Hello from the Kubernetes cluster; exit 1 restartPolicy: Never backoffLimit: 0 和一个$ kubectl get po,jobs,cronjob NAME READY STATUS RESTARTS AGE pod/hello-1592497320-svd6k 0/1 Error 0 44s NAME COMPLETIONS DURATION AGE job.batch/hello-1592497320 0/1 44s 44s NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE cronjob.batch/hello */1 * * * * False 0 51s 11m $ kubectl describe cronjob ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 12m cronjob-controller Created job hello-1592496720 Normal SawCompletedJob 11m cronjob-controller Saw completed job: hello-1592496720 Normal SuccessfulCreate 11m cronjob-controller Created job hello-1592496780 Normal SawCompletedJob 10m cronjob-controller Saw completed job: hello-1592496780 Normal SuccessfulDelete 10m cronjob-controller Deleted job hello-1592496720 Normal SuccessfulCreate 10m cronjob-controller Created job hello-1592496840 Normal SuccessfulDelete 9m55s cronjob-controller Deleted job hello-1592496780 Normal SawCompletedJob 9m55s cronjob-controller Saw completed job: hello-1592496840 Normal SuccessfulCreate 9m5s cronjob-controller Created job hello-1592496900 Normal SawCompletedJob 8m55s cronjob-controller Saw completed job: hello-1592496900 Normal SuccessfulDelete 8m55s cronjob-controller Deleted job hello-1592496840 Normal SuccessfulCreate 8m5s cronjob-controller Created job hello-1592496960 Normal SawCompletedJob 7m55s cronjob-controller Saw completed job: hello-1592496960 Normal SuccessfulDelete 7m55s cronjob-controller Deleted job hello-1592496900 Normal SuccessfulCreate 7m4s cronjob-controller Created job hello-1592497020 可以同时运行(当有2个作业和2个容器时,可能会有10秒的间隔)。

job

我希望它可以清除一点。如果您想要更准确的答案,请提供有关您的情况的更多信息。

答案 1 :(得分:0)

默认concurrencyPolicy:允许。

您可以设置concurrencyPolicy: Forbid以避免并行运行新作业。

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "* * * * *"
  # Allow | Forbid | Replace
  concurrencyPolicy: Forbid
  jobTemplate: