我正在尝试使用下面的命令在https://github.com/mlflow/mlflow/tree/master/examples/docker之后,在Kubernetes中将MLProject放入run
MLFLOW_TRACKING_URI=https://xxxx.xxxx.com MLFLOW_TRACKING_USERNAME=xxxxx MLFLOW_TRACKING_PASSWORD=xxxxxxxx mlflow run -P alpha=0.5 . --backend kubernetes --backend-config kubernetes_config.json --experiment-name test1
describe pod只能显示 MLFLOW_TRACKING_URI ,而不能提取其他两个env变量,请参见下文
Name: tutorial-2020-09-22-11-17-54-381954-mgptt
Namespace: mlflow
Priority: 0
Node: kind-worker/172.19.0.3
Containers:
tutorial:
Container ID: containerd://7de7124f96ea83da474e66bc7b2119cba4d4e0ab7188e135e3ed190dde5c8df4
State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 22 Sep 2020 11:17:55 +0200
Finished: Tue, 22 Sep 2020 11:17:56 +0200
Ready: False
Restart Count: 0
...
Environment:
MLFLOW_RUN_ID: 1a83aa6a16704883a775ad50bc94c7e9
MLFLOW_TRACKING_URI: https://xxxx.xxxxx.com
MLFLOW_EXPERIMENT_ID: 73
...
下面在MLproject文件中描述了env变量,注意下面提到的卷也未按预期装入
MLproject
docker_env:
image: somaupday/mlflow-sklearn:latest
environment: [["MLFLOW_TRACKING_URI", "https://xxxx.xxxxx.com"], ["MLFLOW_TRACKING_USERNAME", "mlflow"], ["MLFLOW_TRACKING_PASSWORD", "xxxxxxx"]]
volumes: ["${HOME}/.aws:/root/.aws"]
其他相关文件供参考 kubernetes_config.json
{
"kube-context": "kind-kind",
"repository-uri": "xxxxx/mlflow-sklearn",
"kube-job-template-path": "./kubernetes_job_template.yaml"
}
kubernetes_job_template.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: "{replaced with MLflow Project name}"
namespace: mlflow
spec:
ttlSecondsAfterFinished: 100
backoffLimit: 0
template:
spec:
containers:
- name: "{replaced with MLflow Project name}"
image: "{replaced with URI of Docker image created during Project execution}"
command: ["{replaced with MLflow Project entry point command}"]
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
restartPolicy: Never
当我向kubernetes_job_template.yaml文件中添加env: ["{appended with MLFLOW_TRACKING_URI, MLFLOW_RUN_ID and MLFLOW_EXPERIMENT_ID}"]
时,我也遇到了问题,因为此模板值未替换为实际值,并且最终创建了无效的Kubernetes Job清单。