OCI运行时创建失败:container_linux.go:349:启动在sagemaker上引起的容器进程

时间:2020-10-06 10:17:07

标签: docker tensorflow amazon-sagemaker

我正在尝试在AWS sagemaker上的script mode中运行模型(python脚本)。我尝试使用Tensorflow估算器从笔记本调用脚本,如下所示

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
                         entry_point='train.py', 
                         role=role,
                         train_instance_count=1,
                         train_instance_type='local_gpu',
                         framework_version='1.12',
                         py_version='py3',
                         script_mode=True,
                         hyperparameters={'epochs': 10})

tf_estimator.fit({'training': training_path_input, 'validation': validation_path_input})

我得到如下所示的错误。

>     Creating tmpvq65nmup_algo-1-wipol_1 ... 
>     ting tmpvq65nmup_algo-1-wipol_1 ... error
>     ERROR: for tmpvq65nmup_algo-1-wipol_1  Cannot start service algo-1-wipol: OCI runtime create failed: container_linux.go:349:
> starting container process caused "process_linux.go:449: container
> init caused \"process_linux.go:432: running prestart hook 1 caused
> \\\"error running hook: exit status 1, stdout: , stderr:
> nvidia-container-cli: initialization error: nvml error: driver not
> loaded\\\\n\\\"\"": unknown

我想知道如何解决此问题。

1 个答案:

答案 0 :(得分:0)

您好,您能否提供有关笔记本实例的更多信息,以及运行笔记本实例的内核?

问题似乎是未安装nvidia驱动程序。