Question

我只有一个GPU（Titan X Pascal，12 GB VRAM），我想在同一个GPU上并行训练多个模型。

我尝试在单个python程序（称为model.py）中封装我的模型，并在model.py中包含代码以限制VRAM使用（基于this example）。我能够在我的GPU上同时运行3个model.py实例（每个实例占我的VRAM的不到33％）。神奇的是，当我尝试使用4个模型时，我收到了一个错误：

2017-09-10 13:27:43.714908: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] coul d not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2017-09-10 13:27:43.714973: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] coul d not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2017-09-10 13:27:43.714988: F tensorflow/core/kernels/conv_ops.cc:672] Check failed : stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNon fusedAlgo<T>(), &algorithms) Aborted (core dumped)

我后来发现on the tensorflow Github人们似乎认为每个GPU运行多个tensorflow进程是不安全的。这是真的，并且有解释为什么会这样吗？为什么我能够在同一GPU上运行3个tensorflow进程而不是4？

Answer 1

简而言之：是的，在同一GPU上运行多个procce是安全的（截至2017年5月）。这样做以前是不安全的。

Link to tensorflow source code that confirms this

Answer 2

答案

取决于视频内存大小，是否允许。

就我而言，我的总视频内存为2GB，而单个实例大约为1.4GB。当我已经在运行the speech recognition training时尝试运行另一个张量流代码时。

2018-08-28 08:52:51.279676: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.2415
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.65GiB
2018-08-28 08:52:51.294948: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-28 08:52:55.643813: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 08:52:55.647912: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971]      0
2018-08-28 08:52:55.651054: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0:   N
2018-08-28 08:52:55.656853: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1409 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute
capability: 5.0)

我在语音识别中遇到以下错误，该错误完全终止了脚本：（我认为根据to this，这与视频内存不足有关）

2018-08-28 08:53:05.154711: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:1108] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED ::
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1278, in _do_call
    return fn(*args)
  File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1263, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

在同一GPU上运行多个tensorflow进程是不安全的吗？

2 个答案: