分配GPU内存时的OOM - TF gpu_options不起作用

时间:2018-04-13 17:28:21

标签: python tensorflow out-of-memory

我收到像

这样的警告

W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.62MiB. Current allocation summary follows.

最终,程序崩溃无法找到足够的内存,如下所示 -

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,160,400] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

我已将gpu选项allow_growthper_process_gpu_memory_fraction设置如下 -

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.4

estimator = tf.contrib.learn.Estimator(
        model_fn=model_fn,
        model_dir=MODEL_DIR,
        config=tf.contrib.learn.RunConfig(session_config=config))

但是,没有一个选项可行。

以下是日志 -

INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x2aac9e3585c0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.4
  allow_growth: true
}
INFO:tensorflow:Graph was finalized.
2018-04-12 14:51:32.876271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: Tesla K40m major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.09GiB
2018-04-12 14:51:32.876365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-12 14:51:33.690027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4575 MB memory) -> physical GPU (device: 0, name: Tesla K40m, pci bus id: 0000:86:00.0, compute capability: 3.5)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2018-04-12 15:03:36.187286: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.62MiB.  Current allocation summary follows.

以前有人遇到过这个问题吗?

请注意,当模型为init时会发生这种情况,因此选项allow_growthper_process_gpu_memory_fraction应该已经解决了这个问题,但它们无效。

解决问题的任何指示或提示都会有所帮助。

我在Github上看过类似的问题和问题,但没有一个是有用的 -
oom-when-allocating-tensor
Tensorflow doesn't allocate full GPU memory
Setting session from TrainConfig doesn't seem to work

0 个答案:

没有答案