ML模型吊舱在Seldon部署中不断重新启动

时间:2020-06-25 07:09:40

标签: kubernetes mlflow seldon

我有一个这样的Seldon部署:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://seldon-models/mlflow/elasticnet_wine
        name: classifier
      name: default
      replicas: 1     

已成功从服务器下载了模型,但是过了一会儿,pod进入状态crashloop,并一次又一次地重新启动。

当我看到日志时,因为重新启动了日志,所以没有错误,我只能看到python软件包的下载方式。

PS C:\Users\xxx\mlflow> kubectl logs -p -c wines-classifier model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Executing before-run script
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
scipy-1.1.0          | 13.2 MB   | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
openssl-1.1.1g       | 2.5 MB    | ########## | 100%
mkl_fft-1.0.6        | 135 KB    | ########## | 100%
blas-1.0             | 6 KB      | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
sqlite-3.32.3        | 1.1 MB    | ########## | 100%
numpy-1.15.4         | 34 KB     | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
certifi-2020.6.20    | 156 KB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | #########  |  91%

现在,尝试使用@ arghya-sadhu建议的-p参数:

PS C:\Users\xxx\mlflow> kubectl logs -p model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp wines-classifier
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
scipy-1.1.0          | 13.2 MB   | #########3 |  93%

以及广告连播的描述:

PS C:\Users\ivarea\repo\smartgraph\mlflow-v2> kubectl describe pod model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Name:         model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Namespace:    default
Priority:     0
Node:         mlops-control-plane/172.19.0.2
Start Time:   Thu, 25 Jun 2020 10:08:20 +0200
Labels:       app=model-a-wines-classifier-0-wines-classifier
              fluentd=true
              pod-template-hash=5b8bc7889d
              seldon-app=model-a-wines-classifier
              seldon-app-svc=model-a-wines-classifier-wines-classifier
              seldon-deployment-id=model-a
              version=wines-classifier
Annotations:  prometheus.io/path: /prometheus
              prometheus.io/scrape: true
Status:       Running
IP:           10.244.0.17
IPs:
  IP:           10.244.0.17
Controlled By:  ReplicaSet/model-a-wines-classifier-0-wines-classifier-5b8bc7889d
Init Containers:
  wines-classifier-model-initializer:
    Container ID:  containerd://6a3b158cf4218f8c177f6d18eb5d0387946bf9cc36f1173754b68a029483da8b
    Image:         gcr.io/kfserving/storage-initializer:0.2.2
    Image ID:      gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
    Port:          <none>
    Host Port:     <none>
    Args:
      gs://seldon-models/mlflow/model-a
      /mnt/models
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Jun 2020 10:08:24 +0200
      Finished:     Thu, 25 Jun 2020 10:08:47 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from wines-classifier-provision-location (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Containers:
  wines-classifier:
    Container ID:   containerd://536753d25877994a17d1f1a63bbaf8717dc9180b80f061152688e4c8504c8468
    Image:          seldonio/mlflowserver_rest:0.5
    Image ID:       docker.io/seldonio/mlflowserver_rest@sha256:0fd54a0a314fafc82c490c91df0c4776be454702a307b4b76e12ed6958b4ee00
    Ports:          6000/TCP, 9000/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 25 Jun 2020 10:23:28 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 25 Jun 2020 10:19:09 +0200
      Finished:     Thu, 25 Jun 2020 10:20:41 +0200
    Ready:          False
    Restart Count:  7
    Liveness:       tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
    Readiness:      tcp-socket :http delay=20s timeout=1s period=5s #success=1 #failure=3
    Environment:
      PREDICTIVE_UNIT_SERVICE_PORT:          9000
      PREDICTIVE_UNIT_ID:                    wines-classifier
      PREDICTIVE_UNIT_IMAGE:                 seldonio/mlflowserver_rest:0.5
      PREDICTOR_ID:                          wines-classifier
      PREDICTOR_LABELS:                      {"version":"wines-classifier"}
      SELDON_DEPLOYMENT_ID:                  model-a
      PREDICTIVE_UNIT_METRICS_SERVICE_PORT:  6000
      PREDICTIVE_UNIT_METRICS_ENDPOINT:      /prometheus
      PREDICTIVE_UNIT_PARAMETERS:            [{"name":"model_uri","value":"/mnt/models","type":"STRING"}]
    Mounts:
      /etc/podinfo from podinfo (rw)
      /mnt/models from wines-classifier-provision-location (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
  seldon-container-engine:
    Container ID:  containerd://938e8f7e3ac23355c8a7a475b71ab54b858aff5ca485f26b99feaba09bb60069
    Image:         docker.io/seldonio/seldon-core-executor:1.1.0
    Image ID:      docker.io/seldonio/seldon-core-executor@sha256:661173fcbc6cb4e9b56db353b19e97d04d9c086e9dc445217f84dc1721bdf894
    Ports:         8000/TCP, 8000/TCP, 5001/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --sdep
      model-a
      --namespace
      default
      --predictor
      wines-classifier
      --http_port
      8000
      --grpc_port
      5001
      --transport
      rest
      --protocol
      seldon
      --prometheus_path
      /prometheus
    State:          Running
      Started:      Thu, 25 Jun 2020 10:08:51 +0200
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   http-get http://:8000/live delay=20s timeout=60s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8000/ready delay=20s timeout=60s period=5s #success=1 #failure=3
    Environment:
      ENGINE_PREDICTOR:  <binary ommited>
      REQUEST_LOGGER_DEFAULT_ENDPOINT_PREFIX:  http://default-broker.
      SELDON_LOG_MESSAGES_EXTERNALLY:          false
    Mounts:
      /etc/podinfo from podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  wines-classifier-provision-location:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-6vqwk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6vqwk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                          Message
  ----     ------     ----                 ----                          -------
  Normal   Scheduled  <unknown>            default-scheduler             Successfully assigned default/model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp to mlops-control-plane
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier-model-initializer
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier-model-initializer
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "seldonio/mlflowserver_rest:0.5" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "docker.io/seldonio/seldon-core-executor:1.1.0" already present on machine
  Normal   Created    14m                  kubelet, mlops-control-plane  Created container seldon-container-engine
  Normal   Started    14m                  kubelet, mlops-control-plane  Started container seldon-container-engine
  Warning  Unhealthy  14m (x8 over 14m)    kubelet, mlops-control-plane  Readiness probe failed: dial tcp 10.244.0.17:9000: connect: connection refused
  Warning  Unhealthy  28s (x171 over 14m)  kubelet, mlops-control-plane  Readiness probe failed: HTTP probe failed with statuscode: 503

如何禁用重新启动功能,以便可以检查日志以查看实际错误?

2 个答案:

答案 0 :(得分:1)

可能默认的 liveness 和 readiness 探针超时时间太短,无法让分类器容器完成依赖项的安装。在容器启动之前,Kubernetes 已经重新启动它,因为它没有通过 liveness/readiness 探测。

就我而言,我必须在 Seldon 部署声明中添加以下内容以增加超时时间(当然您可以调整值):

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: ...
spec:
  name: ...
  predictors:
    - graph:
        ...
      name: ...
      replicas: ...
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                readinessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3
                livenessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3

答案 1 :(得分:0)

使用-p标志,如下面的示例命令所示,以检查来自容器ruby(示例)的先前终止的web-1(示例)容器日志的日志

kubectl logs -p -c ruby web-1

使用命令kubectl get events

检查事件

使用kubectl describe pod podname检查是什么引起了crashloop