我只有一个GPU(Titan X Pascal,12 GB VRAM),我想在同一个GPU上并行训练多个模型。
我尝试在单个python程序(称为model.py)中封装我的模型,并在model.py中包含代码以限制VRAM使用(基于this example)。我能够在我的GPU上同时运行3个model.py实例(每个实例占我的VRAM的不到33%)。神奇的是,当我尝试使用4个模型时,我收到了一个错误:
2017-09-10 13:27:43.714908: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] coul
d not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-09-10 13:27:43.714973: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] coul
d not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-09-10 13:27:43.714988: F tensorflow/core/kernels/conv_ops.cc:672] Check failed
: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNon
fusedAlgo<T>(), &algorithms)
Aborted (core dumped)
我后来发现on the tensorflow Github人们似乎认为每个GPU运行多个tensorflow进程是不安全的。这是真的,并且有解释为什么会这样吗?为什么我能够在同一GPU上运行3个tensorflow进程而不是4?
答案 0 :(得分:4)
简而言之:是的,在同一GPU上运行多个procce是安全的(截至2017年5月)。这样做以前是不安全的。
答案 1 :(得分:-1)
答案
取决于视频内存大小,是否允许。
就我而言,我的总视频内存为2GB,而单个实例大约为1.4GB。当我已经在运行the speech recognition training时尝试运行另一个张量流代码时。
2018-08-28 08:52:51.279676: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.2415
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.65GiB
2018-08-28 08:52:51.294948: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-28 08:52:55.643813: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 08:52:55.647912: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0
2018-08-28 08:52:55.651054: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N
2018-08-28 08:52:55.656853: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1409 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute
capability: 5.0)
我在语音识别中遇到以下错误,该错误完全终止了脚本:(我认为根据to this,这与视频内存不足有关)
2018-08-28 08:53:05.154711: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:1108] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED ::
Traceback (most recent call last):
File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1278, in _do_call
return fn(*args)
File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed