OOM错误运行resnet模型tensorflow

时间:2016-10-04 22:46:00

标签: tensorflow

我正在EC2 g2(NVIDIA GRID K520)实例上的https://github.com/tensorflow/models/blob/master/resnet/resnet_main.py中运行resnet模型并看到OOM错误。我已经尝试了各种组合,删除使用GPU的代码,前缀为CUDA_VISIBLE_DEVICES ='0',并将batch_size减少到64.我仍然无法启动培训。你能帮我吗?

  

W tensorflow / core / common_runtime / bfc_allocator.cc:270] ********************** x *********** ************************************************** ************** XX   W tensorflow / core / common_runtime / bfc_allocator.cc:271]尝试分配196.00MiB时内存不足。查看内存状态的日志。   W tensorflow / core / framework / op_kernel.cc:936]资源耗尽:OOM在分配形状的张量时[64,16,224,224]   E tensorflow / core / client / tensor_c_api.cc:485] OOM在分配形状的张量时[64,16,224,224]        [[节点:unit_1_2 / sub1 / conv1 / Conv2D = Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“SAME”,strides = [1,1,1,1],use_cudnn_on_gpu = true,_device =“/ job:localhost / replica:0 / task:0 / gpu:0“](unit_1_2 / residual_only_activation / leaky_relu,unit_1_2 / sub1 / conv1 / DW / read)]]        [[Node:train_step / update / _1561 = _Recvclient_terminated = false,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0”,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0“,send_device_incarnation = 1,tensor_name =”edge_10115_train_step / update“,tensor_type = DT_FLOAT,_device =”/ job:localhost / replica:0 / task:0 / cpu:0“]]   Traceback(最近一次调用最后一次):     文件“./resnet_main.py”,第203行,in       tf.app.run()     文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第30行,在运行中       sys.exit(主(sys.argv中))     在主要文件中输入“./resnet_main.py”,第197行       列车(HPS)     在火车上输入“./resnet_main.py”,第82行       feed_dict = {model.lrn_rate:lrn_rate})     运行文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第382行       run_metadata_ptr)     在_run中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第655行       feed_dict_string,options,run_metadata)     在_do_run中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第723行       target_list,options,run_metadata)     在_do_call中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第743行       提升类型(e)(node_def,op,message)   tensorflow.python.framework.errors.ResourceExhaustedError:分配张量形状时的OOM [64,16,224,224]        [[节点:unit_1_2 / sub1 / conv1 / Conv2D = Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“SAME”,strides = [1,1,1,1],use_cudnn_on_gpu = true,_device =“/ job:localhost / replica:0 / task:0 / gpu:0“](unit_1_2 / residual_only_activation / leaky_relu,unit_1_2 / sub1 / conv1 / DW / read)]]        [[Node:train_step / update / _1561 = _Recvclient_terminated = false,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0”,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0“,send_device_incarnation = 1,tensor_name =”edge_10115_train_step / update“,tensor_type = DT_FLOAT,_device =”/ job:localhost / replica:0 / task:0 / cpu:0“]]   由op u'unit_1_2 / sub1 / conv1 / Conv2D'引起,定义于:     文件“./resnet_main.py”,第203行,in       tf.app.run()     文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第30行,在运行中       sys.exit(主(sys.argv中))     在主要文件中输入“./resnet_main.py”,第197行       列车(HPS)     列车中的“./resnet_main.py”,第64行       model.build_graph()     在build_graph中输入文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第59行       self._build_model()     在_build_model中输入文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第94行       x = res_func(x,filters [1],filters [1],self._stride_arr(1),False)     文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第208行,在_residual       x = self._conv('conv1',x,3,in_filter,out_filter,stride)     在_conv中输入文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第279行       return tf.nn.conv2d(x,kernel,strides,padding ='SAME')     文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py”,第394行,在conv2d中       data_format = data_format,name = name)     在apply_op中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”,第703行       op_def = op_def)     在create_op中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”,第2310行       original_op = self._default_original_op,op_def = op_def)     在 init 中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”,第1232行       self._traceback = _extract_stack()

1 个答案:

答案 0 :(得分:0)

NVIDIA GRID K520拥有8GB内存(link)。我已经成功地在具有12GB内存的NVIDIA GPU上训练了ResNet模型。正如错误所示,TensorFlow尝试将所有网络权重放入GPU内存并失败。我相信你有几个选择:

  • 仅在CPU上进行训练,如评论中所述,假设您的CPU内存超过8GB。不建议这样做。
  • 使用较少的参数训练不同的网络。自Resnet以来已经发布了几个网络,例如Inception-v4, Inception-ResNet,参数更少,准确性也相当。这个选项无需任何费用!
  • 购买内存更多的GPU。如果你有钱,最简单的选择。
  • 购买另一台具有相同内存的GPU,并将网络的下半部分训练为一部分,将网络的上半部分训练为另一部分。 GPU之间通信的困难使得该选项不太理想。

我希望这可以帮助您和其他遇到类似内存问题的人。