我正在运行一个GCMLE实验,尝试使用MirroredStrategy
运行分布式GPU。该代码可以在没有分布式GPU的情况下正常运行,并且要进行更改,我将run_config调整为接受train_distribute=tf.contrib.distribute.MirroredStrategy(num_gpus=4)
,并且我的配置文件使用complex_model_m_p100
机器应配置4个GPU。我收到警告Error reported to Coordinator: libnccl.so.2: cannot open shared object file: No such file or directory
,然后最终出现NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory
错误(请参见下面的完整堆栈跟踪)。乍一看,这似乎是一个内部错误,在我尝试使用的计算机上未安装适当的库。该github issue的响应者似乎建议需要安装“ NCCL2”。我有什么办法可以解决这个错误,还是我无法控制的GCMLE后端问题?
Stacktrace:
The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
[...]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 368, in _batch_reduce
value_destination_pairs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 182, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 524, in _batch_reduce
[v[0] for v in value_destination_pairs])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 556, in _batch_all_reduce
device_grad_packs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_utils.py", line 38, in aggregate_gradients_using_nccl
agg_grads = nccl.all_sum(single_grads)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum
return _apply_all_reduce('sum', tensors)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 217, in _apply_all_reduce
_validate_and_load_nccl_so()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 288, in _validate_and_load_nccl_so
_maybe_load_nccl_ops_so()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 274, in _maybe_load_nccl_ops_so
resource_loader.get_path_to_datafile('_nccl_ops.so'))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/util/loader.py", line 56, in load_op_library
ret = load_library.load_op_library(path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory