我已成功开始在Google Cloud中进行培训工作。但是,在运行30分钟到1个小时并执行了几千步之后,它们结束了一条无意义的错误消息:“ CancelledError:Canceled”。
我正在训练分布在16个tfrecord文件上的约30K图像。在单个文件中训练较少数量的图像(约5K左右)时,我没有这个问题。
以下是详细信息: 我使用以下命令开始工作:
gcloud ai-platform jobs submit training my_job_name \
--runtime-version 1.13 \
--job-dir=gs://image-training/my_job_dir \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-east1 --config object_detection/CLOUDgpu.yaml \
--python-version 3.5 \
-- \
--model_dir gs://image-training/my_job_dir \
--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config
这是我的YAML文件:
trainingInput:
runtimeVersion: "1.13"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 9
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
我的配置文件像这样引用数据文件:
train_input_reader: {
tf_record_input_reader {
input_path: "gs://image-training/t0423data/train_*_re.tfrecord"
}
num_readers:3
label_map_path: "gs://image-training/PigCount/label_map.pbtxt"
}
最后,完整的错误:
The replica worker 6 exited with a non-zero status of 1. Termination reason:
Error. Traceback (most recent call last): [...] saving_listeners) File
"/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py",
line 1407, in _train_with_estimator_spec _, loss =
mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 676, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1171, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1270, in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise
value File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1255, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1327, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1091, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 929, in run run_metadata_ptr) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1152, in _run feed_dict_tensor, options, run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1328, in _do_run run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1348, in _do_call raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Cancelled To find out
more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?project=226138759195&resource=ml_job%2Fjob_id%2Ft_05_01_big_data1&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22t_05_01_big_data1%22
副本6中的日志显示以下错误:
command '['python3', '-m', 'object_detection.model_main', '--model_dir', 'gs://image-training/my_job_dir', '--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config', '--job-dir', 'gs://image-training/my_job_dir']' returned non-zero exit status 1
在此之前:
worker-replica-6
Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.CancelledError: Cancelled During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 109, in <module> tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 638, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 648, in run_worker return self._start_distributed_training() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 789, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.CancelledError: Cancelled
有什么主意,我该如何防止这些工作失败?
答案 0 :(得分:0)
我似乎已经通过增加所用机器的数量和功能来解决了这个问题。我将YAML文件更改为此,它可以运行50,000个步骤,没有任何问题。方式更昂贵,但至少可以正常工作!:
trainingInput:
scaleTier: CUSTOM
# Configure a master worker with 4 K80 GPUs
masterType: n1-highcpu-16
masterConfig:
acceleratorConfig:
count: 4
type: NVIDIA_TESLA_K80
# Configure 9 workers, each with 4 K80 GPUs
workerCount: 9
workerType: n1-highcpu-16
workerConfig:
acceleratorConfig:
count: 4
type: NVIDIA_TESLA_K80
# Configure 3 parameter servers with no GPUs
parameterServerCount: 3
parameterServerType: n1-highmem-8
有关完整说明,请参见此页:https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus