您好tensorflow和Google Cloud用户/开发人员
当我提交需要GPU支持的作业时,ml-engine在加载libnccl.so.2文件时失败。这是gcloud日志的输出:
INFO 2019-01-07 15:13:58 +0000 master-replica-0 Error reported to Coordinator: libnccl.so.2: cannot open shared object file: No such file or directory
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 Traceback (most recent call last):
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 "__main__", mod_spec)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 exec(code, run_globals)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/root/.local/lib/python3.5/site-packages/main/task.py", line 220, in <module>
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 main()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/root/.local/lib/python3.5/site-packages/main/task.py", line 185, in main
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 valid_spec
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return executor.run()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 637, in run
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 getattr(self, task_to_run)()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 674, in run_master
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 self._start_distributed_training(saving_listeners=saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 saving_listeners=saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 loss = self._train_model(input_fn, hooks, saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._train_model_distributed(input_fn, hooks, saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 self.config)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._call_for_each_tower(fn, *args, **kwargs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return _call_for_each_tower(self, fn, *args, **kwargs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 coord.join(threads)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 six.reraise(*self._exc_info_to_raise)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 raise value
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 yield
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 **merge_kwargs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 variable_scope.VariableAggregation.SUM, grads_and_vars)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._batch_reduce(aggregation, value_destination_pairs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 value_destination_pairs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._batch_reduce(aggregation, value_destination_pairs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 597, in _batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 [v[0] for v in value_destination_pairs])
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 631, in _batch_all_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 device_grad_packs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_utils.py", line 41, in aggregate_gradients_using_nccl
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 agg_grads = nccl.all_sum(single_grads)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return _apply_all_reduce('sum', tensors)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 217, in _apply_all_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 _validate_and_load_nccl_so()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 288, in _validate_and_load_nccl_so
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 _maybe_load_nccl_ops_so()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 274, in _maybe_load_nccl_ops_so
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 resource_loader.get_path_to_datafile('_nccl_ops.so'))
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/util/loader.py", line 56, in load_op_library
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 ret = load_library.load_op_library(path)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/load_library.py", line 60, in load_op_library
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 lib_handle = py_tf.TF_LoadLibrary(library_filename)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 tensorflow.python.framework.errors_impl.NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory
我应该将nccl安装到ml-engine吗?我在setup.py的required_packages字段中指定“ tensorflow-gpu(> = 1.12)”。我的config.yaml文件看起来像这样:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
workerType: complex_model_m_gpu
parameterServerType: large_model
workerCount: 0
parameterServerCount: 0
我的配额允许我在Europe-west1地区使用4台K-80设备。
非常感谢您的帮助。