Google Cloud ml-engine无法加载libnccl

时间:2019-01-07 15:20:43

标签: tensorflow google-cloud-platform google-cloud-ml

您好tensorflow和Google Cloud用户/开发人员

当我提交需要GPU支持的作业时,ml-engine在加载libnccl.so.2文件时失败。这是gcloud日志的输出:

INFO    2019-01-07 15:13:58 +0000   master-replica-0        Error reported to Coordinator: libnccl.so.2: cannot open shared object file: No such file or directory
ERROR   2019-01-07 15:13:58 +0000   master-replica-0        Traceback (most recent call last):
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            "__main__", mod_spec)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            exec(code, run_globals)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/root/.local/lib/python3.5/site-packages/main/task.py", line 220, in <module>
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            main()
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/root/.local/lib/python3.5/site-packages/main/task.py", line 185, in main
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            valid_spec
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            return executor.run()
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 637, in run
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            getattr(self, task_to_run)()
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 674, in run_master
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            self._start_distributed_training(saving_listeners=saving_listeners)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            saving_listeners=saving_listeners)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            loss = self._train_model(input_fn, hooks, saving_listeners)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            return self._train_model_distributed(input_fn, hooks, saving_listeners)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            self.config)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            return self._call_for_each_tower(fn, *args, **kwargs)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            return _call_for_each_tower(self, fn, *args, **kwargs)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            coord.join(threads)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            six.reraise(*self._exc_info_to_raise)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            raise value
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            yield
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            **merge_kwargs)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            variable_scope.VariableAggregation.SUM, grads_and_vars)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            return self._batch_reduce(aggregation, value_destination_pairs)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            value_destination_pairs)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            return self._batch_reduce(aggregation, value_destination_pairs)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 597, in _batch_reduce
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            [v[0] for v in value_destination_pairs])
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 631, in _batch_all_reduce
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            device_grad_packs)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_utils.py", line 41, in aggregate_gradients_using_nccl
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            agg_grads = nccl.all_sum(single_grads)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            return _apply_all_reduce('sum', tensors)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 217, in _apply_all_reduce
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            _validate_and_load_nccl_so()
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 288, in _validate_and_load_nccl_so
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            _maybe_load_nccl_ops_so()
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 274, in _maybe_load_nccl_ops_so
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            resource_loader.get_path_to_datafile('_nccl_ops.so'))
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/util/loader.py", line 56, in load_op_library
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            ret = load_library.load_op_library(path)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0          File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/load_library.py", line 60, in load_op_library
ERROR   2019-01-07 15:13:58 +0000   master-replica-0            lib_handle = py_tf.TF_LoadLibrary(library_filename)
ERROR   2019-01-07 15:13:58 +0000   master-replica-0        tensorflow.python.framework.errors_impl.NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory

我应该将nccl安装到ml-engine吗?我在setup.py的required_pa​​ckages字段中指定“ tensorflow-gpu(> = 1.12)”。我的config.yaml文件看起来像这样:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu
  workerType: complex_model_m_gpu
  parameterServerType: large_model
  workerCount: 0
  parameterServerCount: 0

我的配额允许我在Europe-west1地区使用4台K-80设备。

非常感谢您的帮助。

0 个答案:

没有答案