即使在发送SIGTERM之后,Kubernetes也会将流量发送到Pod

时间:2019-07-20 05:31:40

标签: spring-boot deployment kubernetes kubernetes-pod

我有一个配置了正常关机的SpringBoot项目。部署在k8s 1.12.7上,这是日志,

2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event
2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish
2019-07-20 10:23:16.273 INFO [service,fd964ebaa631a860,75a07c123397e4ff,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=59
2019-07-20 10:23:16.374 INFO [service,9a569ecd8c448e98,00bc11ef2776d7fb,false] 1 --- [nio-8080-exec-1] com.jay.resource.ProductResource : GET /products?id=68
...
2019-07-20 10:23:33.711 INFO [service,1532d6298acce718,08cfb8085553b02e,false] 1 --- [nio-8080-exec-9] com.jay.resource.ProductResource : GET /products?id=209
2019-07-20 10:23:46.181 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation
2019-07-20 10:23:46.216 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'

应用程序已从Kubernetes在SIGTERM收到了10:23:16.180。根据{{​​3}} point#5的说法,该终止pod已从服务的端点列表中删除,但是与它在发送{{之后,直到10:23:33.711 1}}信号。是否缺少任何配置?

SIGTERM

Dockerfile

FROM openjdk:8-jre-slim MAINTAINER Jay RUN apt update && apt install -y curl libtcnative-1 gcc && apt clean ADD build/libs/sample-service.jar / CMD ["java", "-jar" , "sample-service.jar"]

GracefulShutdown

// https://github.com/spring-projects/spring-boot/issues/4657 class GracefulShutdown(val waitTime: Long, val timeout: Long) : TomcatConnectorCustomizer, ApplicationListener<ContextClosedEvent> { @Volatile private var connector: Connector? = null override fun customize(connector: Connector) { this.connector = connector } override fun onApplicationEvent(event: ContextClosedEvent) { log.info("Received shutdown event") val executor = this.connector?.protocolHandler?.executor if (executor is ThreadPoolExecutor) { try { val threadPoolExecutor: ThreadPoolExecutor = executor log.info("Waiting for ${waitTime}s to finish") hibernate(waitTime * 1000) log.info("Resumed after hibernation") this.connector?.pause() threadPoolExecutor.shutdown() if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) { log.warn("Tomcat thread pool did not shut down gracefully within $timeout seconds. Proceeding with forceful shutdown") threadPoolExecutor.shutdownNow() if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) { log.error("Tomcat thread pool did not terminate") } } } catch (ex: InterruptedException) { log.info("Interrupted") Thread.currentThread().interrupt() } }else this.connector?.pause() } private fun hibernate(time: Long){ try { Thread.sleep(time) }catch (ex: Exception){} } companion object { private val log = LoggerFactory.getLogger(GracefulShutdown::class.java) } } @Configuration class GracefulShutdownConfig(@Value("\${app.shutdown.graceful.wait-time:30}") val waitTime: Long, @Value("\${app.shutdown.graceful.timeout:30}") val timeout: Long) { companion object { private val log = LoggerFactory.getLogger(GracefulShutdownConfig::class.java) } @Bean fun gracefulShutdown(): GracefulShutdown { return GracefulShutdown(waitTime, timeout) } @Bean fun webServerFactory(gracefulShutdown: GracefulShutdown): ConfigurableServletWebServerFactory { log.info("GracefulShutdown configured with wait: ${waitTime}s and timeout: ${timeout}s") val factory = TomcatServletWebServerFactory() factory.addConnectorCustomizers(gracefulShutdown) return factory } }

deployment file

更新:

添加了自定义健康检查端点

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    k8s-app: service
  name: service
spec:
  progressDeadlineSeconds: 420
  replicas: 1
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      k8s-app: service
  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        k8s-app: service
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - env:
        - name: SPRING_PROFILES_ACTIVE
          value: dev
        image: service:2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 20
          httpGet:
            path: /actuator/health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 5
        name: service
        ports:
        - containerPort: 8080
          protocol: TCP
        readinessProbe:
          failureThreshold: 60
          httpGet:
            path: /actuator/health
            port: 8080
          initialDelaySeconds: 100
          periodSeconds: 10
          timeoutSeconds: 5

更改了@RestControllerEndpoint(id = "live") @Component class LiveEndpoint { companion object { private val log = LoggerFactory.getLogger(LiveEndpoint::class.java) } @Autowired private lateinit var gracefulShutdownStatus: GracefulShutdownStatus @GetMapping fun live(): ResponseEntity<Any> { val status = if(gracefulShutdownStatus.isTerminating()) HttpStatus.INTERNAL_SERVER_ERROR.value() else HttpStatus.OK.value() log.info("Status: $status") return ResponseEntity.status(status).build() } }

livenessProbe

这是更改后的日志,

  livenessProbe:
    httpGet:
      path: /actuator/live
      port: 8080
    initialDelaySeconds: 100
    periodSeconds: 5
    timeoutSeconds: 5
    failureThreshold: 3

在3次失败的2019-07-21 14:13:01.431 INFO [service,9b65b26907f2cf8f,9b65b26907f2cf8f,false] 1 --- [nio-8080-exec-2] com.jay.util.LiveEndpoint : Status: 200 2019-07-21 14:13:01.444 INFO [service,3da259976f9c286c,64b0d5973fddd577,false] 1 --- [nio-8080-exec-3] com.jay.resource.ProductResource : GET /products?id=52 2019-07-21 14:13:01.609 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event 2019-07-21 14:13:01.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish ... 2019-07-21 14:13:06.431 INFO [service,002c0da2133cf3b0,002c0da2133cf3b0,false] 1 --- [nio-8080-exec-3] com.jay.util.LiveEndpoint : Status: 500 2019-07-21 14:13:06.433 INFO [service,072abbd7275103ce,d1ead06b4abf2a34,false] 1 --- [nio-8080-exec-4] com.jay.resource.ProductResource : GET /products?id=96 ... 2019-07-21 14:13:11.431 INFO [service,35aa09a8aea64ae6,35aa09a8aea64ae6,false] 1 --- [io-8080-exec-10] com.jay.util.LiveEndpoint : Status: 500 2019-07-21 14:13:11.508 INFO [service,a78c924f75538a50,0314f77f21076313,false] 1 --- [nio-8080-exec-2] com.jay.resource.ProductResource : GET /products?id=110 ... 2019-07-21 14:13:16.431 INFO [service,38a940dfda03956b,38a940dfda03956b,false] 1 --- [nio-8080-exec-9] com.jay.util.LiveEndpoint : Status: 500 2019-07-21 14:13:16.593 INFO [service,d76e81012934805f,b61cb062154bb7f0,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=152 ... 2019-07-21 14:13:29.634 INFO [service,38a32a20358a7cc4,2029de1ed90e9539,false] 1 --- [nio-8080-exec-6] com.jay.resource.ProductResource : GET /products?id=191 2019-07-21 14:13:31.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation 2019-07-21 14:13:31.692 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor' 中,kubernetes在活动失败后(即从livenessProbe14:13:16.431)为流量提供了13秒的服务。

更新2: 事件的顺序(由于14:13:29.634

Eamonn McEvoy

1 个答案:

答案 0 :(得分:2)

SIGTERM不会立即将Pod置于终止状态。您可以在日志中看到您的应用程序在10:23:16.180开始正常关闭,并且需要20秒钟以上才能完成。此时,容器停止,容器可以进入终止状态。

就kubernetes而言,在正常关闭期间,pod看起来还不错。您需要为部署添加活动探针;当它变得不健康时,流量将停止。

livenessProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  initialDelaySeconds: 100
  periodSeconds: 10
  timeoutSeconds: 5

更新:

这是因为您的失败阈值为3,因此您可以在sigterm之后的15秒内允许流量;

例如

seconds | healthy | events
   0    |    ✔    |   * liveness probe healthy
   1    |    ✔    |   - SIGTERM
   2    |    ✔    |   
   3    |    ✔    |   
   4    |    ✔    |   
   5    |    ✔    |   * liveness probe issued
   6    |    ✔    |       .
   7    |    ✔    |       .
   8    |    ✔    |       .
   9    |    ✔    |       .
   10   |    ✔    |   * liveness probe timeout - unhealthy (1/3)
   11   |    ✔    |   
   12   |    ✔    |   
   13   |    ✔    |   
   14   |    ✔    |   
   15   |    ✔    |   * liveness probe issued
   16   |    ✔    |       .
   17   |    ✔    |       .
   18   |    ✔    |       .
   19   |    ✔    |       .
   20   |    ✔    |   * liveness probe timeout - unhealthy (2/3)
   21   |    ✔    |   
   22   |    ✔    |   
   23   |    ✔    |   
   24   |    ✔    |   
   25   |    ✔    |   * liveness probe issued
   26   |    ✔    |       .
   27   |    ✔    |       .
   28   |    ✔    |       .
   29   |    ✔    |       .
   30   |    ✘    |   * liveness probe timeout - unhealthy (3/3)
        |         |   * pod restarts

这是假定端点在正常关机期间返回了不正常的响应。由于您有timeoutSeconds: 5,如果探测只是超时,则将花费更长的时间,在发出活动探测请求和接收其响应之间会有5秒的延迟。可能是这样的情况:在达到活动阈值之前,容器实际上已经死亡,您仍然可以看到原始行为