AWS Sagemaker |如何调试Docker镜像我们传递什么参数

时间:2018-11-30 11:22:16

标签: docker tensorflow amazon-sagemaker

我们要上传一个docker镜像,其中包含我们的tensorflow自定义代码,现在我们遵循此标准代码 https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/tensorflow_bring_your_own.ipynb

我们可以将docker与我们的依赖项一起上传到那里,但是我们无法将S3位置传递给他们的方法,现在我们不确定S3位置是否传递给了容器,因此添加的打印不打印在鼠尾草上。有人可以帮忙调试docker吗,因为自定义日志在cloudwatch上也不可用。

018-11-30 09:55:17 Uploading - Uploading generated training model
2018-11-30 09:55:17 Failed - Training job failed
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-5fc1c1e7ed65> in <module>()
     11                       hyperparameters=hyperparameters)
     12 
---> 13 estimator.fit(data_location)
     14 
     15 # predictor = estimator.deploy(1, instance_type)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    232         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    233         if wait:
--> 234             self.latest_training_job.wait(logs=logs)
    235 
    236     def _compilation_job_name(self):

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
    571     def wait(self, logs=True):
    572         if logs:
--> 573             self.sagemaker_session.logs_for_job(self.job_name, wait=True)
    574         else:
    575             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
   1121 
   1122         if wait:
-> 1123             self._check_job_status(job_name, description, 'TrainingJobStatus')
   1124             if dot:
   1125                 print()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
    821             reason = desc.get('FailureReason', '(No reason provided)')
    822             job_type = status_key_name.replace('JobStatus', ' job')
--> 823             raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
    824 
    825     def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error for Training job tensor-2018-11-30-09-52-12-964: Failed Reason: AlgorithmError: Exception during training: Return Code: 1, CMD: ['/usr/bin/python', 'cifar10.py', '--model-dir', '/opt/ml/model', '--train-steps', '100'], Err: b'/usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n  from ._conv import register_converters as _register_converters\nTraceback (most recent call last):\n  File "cifar10.py", line 195, in <module>\n    main()\n  File "cifar10.py", line 188, in main\n    interactions_processed, user_meta_processed, item_meta_processed, item_feats_set = process_data(interaction_data, interaction_cols, users_meta, users_meta_cols, items_meta, items_meta_cols, user_meta_filterlist=user_meta_list)\n  File "cifar10.py", line 32, in process_data\n    df=pd.read_csv(interaction_data, engine=\'c\', encoding=\'latin1\', usecols=interaction_cols).astype(str)\n  File "/usr/local/lib/python3.5/d

1 个答案:

答案 0 :(得分:0)

查看您的错误消息,这似乎是您遵循的原始示例中没有的问题。

诊断和调试此问题可能需要有关cifar10.py文​​件的更多详细信息,因为示例中提供的堆栈跟踪似乎与原始cifar10.py文​​件不匹配: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/container/cifar10/cifar10.py

此外,我知道迭代可能会非常慢,因此我建议在SageMaker上进行生产之前,先使用本地模式来加快迭代速度。上面引用的示例笔记本列举了这一点,可以通过使用“ local”作为train_instance_type或用于培训/托管的instance_type的值来实现。

您的示例是否可以使用本地目录(file:///)提供的数据集?

如果可以,但是在SageMaker中不起作用,这可能是因为您不希望数据集位于正确的目录中。 SageMaker会将您的数据推送到此处指定的特定通道: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-trainingdata

请让我知道我是否可以澄清。