谷歌colab得到了OOM问题

时间:2018-05-18 19:48:19

标签: keras google-colaboratory

我正在建立一个keras模型来运行一些简单的图像识别任务。如果我在原始的Keras做所有事情,我不打击OOM。然而,奇怪的是,当我通过我编写的迷你框架执行此操作时,这非常简单,主要是为了能够跟踪我使用的超参数和设置,我点击了OOM。大多数执行应该与运行原始Keras相同。我在某处猜测我在代码中犯了一些错误。请注意,在我的本地笔记本电脑上运行CPU时,同样的迷你框架没有问题。我想我需要调试。但在此之前,任何人都有任何一般性建议?

以下是我得到的几行错误:

Epoch 1/50
2018-05-18 17:40:27.435366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-18 17:40:27.435906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:00:04.0 totalMemory: 11.17GiB freeMemory: 504.38MiB
2018-05-18 17:40:27.435992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-05-18 17:40:27.784517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 17:40:27.784675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-05-18 17:40:27.784724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-05-18 17:40:27.785072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 243 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-05-18 17:40:38.569609: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 36.00MiB.  Current allocation summary follows.
2018-05-18 17:40:38.569702: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256):   Total Chunks: 66, Chunks in use: 66. 16.5KiB allocated for chunks. 16.5KiB in use in bin. 2.3KiB client-requested in use in bin.
2018-05-18 17:40:38.569768: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512):   Total Chunks: 10, Chunks in use: 10. 5.0KiB allocated for chunks. 5.0KiB in use in bin. 5.0KiB client- etc. etc

2018-05-18 17:40:38.573706: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[18432,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

1 个答案:

答案 0 :(得分:1)

这是由GPU内存耗尽引起的,因为警告中已清楚这一点。

首先解决方法是,如果可能的话,你可以通过编写这个Config proto并传递给tf.session()

来允许GPU内存增长
activityIndicator: {
      flex: 1,
      justifyContent: 'center',
      alignItems: 'center',
      height: 80
   }

然后将此配置传递给导致此错误的会话。喜欢

$wrapper = new \Artisaninweb\SoapWrapper\SoapWrapper();

$service = $wrapper->add('ServiceName', function ($service) {
    $service->wsdl('http://serviceurl')
            ->cache(WSDL_CACHE_NONE)
            ->trace(true);
});

$data = [
    // ...
];

$loginResponse = $service->call('ServiceName.someMethod', [$data]);

如果这没有帮助,您可以针对导致此错误的特定会话禁用GPU。喜欢这个

   # See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth 
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

如果你正在使用keras,你可以通过提取会话来获得keras的后端并应用这些配置。