Question

我有一个 Next.js 应用，它有 2 个简单的 readiness 和 liveness 端点，实现如下：

return res.status(200).send('OK');

我已按照 the api routes docs 创建了端点。另外，根据 the docs here，我有一个 /stats basePath。因此，探测端点位于 /stats/api/readiness 和 /stats/api/liveness。

当我在本地的 Docker 容器中构建和运行应用程序时 - 可以访问探测端点并返回 200 OK。

当我将应用程序部署到我的 k8s 集群时，探测失败。有很多 initialDelaySeconds 时间，所以这不是原因。

我通过 service 连接到 pod 的 port-forward，当 pod 刚刚启动时，在它失败之前，我可以到达端点并返回 200 OK。然后它开始像往常一样失败。

我还尝试通过健康的 pod 访问失败的 pod：

k exec -t [healthy pod name] -- curl -l 10.133.2.35:8080/stats/api/readiness

同样的情况 - 一开始，虽然 pod 还没有失败，但我在 curl 命令上得到 200 OK。过了一会儿，它开始失败。

我得到的探针错误是：

Readiness probe failed: Get http://10.133.2.35:8080/stats/api/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

有趣的实验 - 我尝试为探针设置一个随机的、不存在的端点，但我得到了同样的错误。这让我想到探测失败是因为它无法访问正确的端点？

但话又说回来，在探测开始失败之前的一段时间内，端点是可以访问的。所以，我真的不知道为什么会这样。

这是我的探针的 k8s 部署配置：

      livenessProbe:
        httpGet:
          path: /stats/api/liveness
          port: 8080
          scheme: HTTP
        initialDelaySeconds: 10
        timeoutSeconds: 3
        periodSeconds: 3
        successThreshold: 1
        failureThreshold: 5
      readinessProbe:
        httpGet:
          path: /stats/api/readiness
          port: 8080
          scheme: HTTP
        initialDelaySeconds: 10
        timeoutSeconds: 3
        periodSeconds: 3
        successThreshold: 1
        failureThreshold: 3

更新

根据评论的要求使用 curl -v。结果是：

*   Trying 10.133.0.12:8080...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.133.0.12 (10.133.0.12) port 8080 (#0)
> GET /stats/api/healthz HTTP/1.1
> Host: 10.133.0.12:8080
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< ETag: "2-nOO9QiTIwXgNtWtBJezz8kv3SLc"
< Content-Length: 2
< Date: Wed, 16 Jun 2021 18:42:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
{ [2 bytes data]
100     2  100     2    0     0    666      0 --:--:-- --:--:-- --:--:--   666
* Connection #0 to host 10.133.0.12 left intact
OK%

当然，一旦开始失败，结果就是：

*   Trying 10.133.0.12:8080...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* connect to 10.133.0.12 port 8080 failed: Connection refused
* Failed to connect to 10.133.0.12 port 8080: Connection refused
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
curl: (7) Failed to connect to 10.133.0.12 port 8080: Connection refused
command terminated with exit code 7

Answer 1

错误告诉您：Client.Timeout exceeded while awaiting headers。表示 TCP 连接已建立（未拒绝，也未超时）。

您的活跃度/就绪度探测超时时间太短。您的应用程序没有足够的时间来响应。

可能是由于 CPU 或内存分配比使用笔记本电脑时要小，由于更高的并发性，可能是 LimitRange 设置了一些默认值，而您没有这样做。

检查：

time kubectl exec -t [healthy pod name] -- curl -l 127.0.0.1:8080/stats/api/readiness

如果您无法分配更多 CPU，请将时间加倍，四舍五入，然后修复您的探测器：

  livenessProbe:
    ...
    timeoutSeconds: 10

  readinessProbe:
    ...
    timeoutSeconds: 10

或者，虽然精神上可能不那么好，但您可以用 tcpSocket 检查替换那些 httpGet 检查。它们会更快，但可能会遗漏实际问题。

即使端点正在工作，k8s 准备和活跃度探测也会失败

1 个答案: