无法使用V100 GPU运行分布式TensorFlow

时间:2019-02-18 09:06:18

标签: python tensorflow nvidia

无法使用GPU运行TensorFlow。代码在CPU中工作。

Debian 9.8版

  • 1个GPU Nvidia Tesla V100
  • TensorFlow-GPU 1.12
  • Nvidia驱动程序: NVIDIA-Linux-x86_64-390.46.run
  • CUDA: cuda_9.0.176_384.81_linux-run
  • CuDNN: cudnn-9.0-linux-x64-v7.4.1.5.tgz
  • NCCL: nccl_2.3.7-1 + cuda9.0_x86_64.txz

更新: 使用CuDNN 7.1.4进行了测试,并且存在相同的问题

补丁

  • cuda_9.0.176.1_linux-run
  • cuda_9.0.176.2_linux-run
  • cuda_9.0.176.3_linux-run
  • cuda_9.0.176.4_linux-run

错误:

et convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node conv1/Conv2D (defined at mnist_distributed.py:119)  = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
     [[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'conv1/Conv2D', defined at:
  File "mnist_distributed.py", line 237, in <module>
    tf.app.run()
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "mnist_distributed.py", line 196, in main
    features, labels, keep_prob, global_step, train_step, accuracy, merged = create_model()
  File "mnist_distributed.py", line 149, in create_model
    y_conv, keep_prob = deepnn(x)
  File "mnist_distributed.py", line 77, in deepnn
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
  File "mnist_distributed.py", line 119, in conv2d
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node conv1/Conv2D (defined at mnist_distributed.py:119)  = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
     [[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]

代码here

图书馆:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda

版本

CUDA

cat /usr/local/cuda/version.txt
CUDA Version 9.0.176
CUDA Patch Version 9.0.176.1
CUDA Patch Version 9.0.176.2
CUDA Patch Version 9.0.176.3
CUDA Patch Version 9.0.176.4

CuDNN

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 1
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"

类似:

https://github.com/tensorflow/tensorflow/issues/24828

Which TensorFlow and CUDA version combinations are compatible?

1 个答案:

答案 0 :(得分:0)

通过详细查看日志,我遇到了OOM错误,然后我在tf.train.Server中更改了以下内容以使其正常工作:

    @Entity
    @Table(name="table_1")
    public class Table1 {

    @Id
    @Column(name="booking_id")
    private String bookingId;

    @OneToOne(fetch = FetchType.EAGER)
    @JoinColumn(name="booking_id")
    private Table2 table2Data;
}

错误:

config_proto = tf.ConfigProto(log_device_placement=True)
config_proto.gpu_options.allow_growth = True
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index, config=config_proto)