我正在EC2 g2(NVIDIA GRID K520)实例上的https://github.com/tensorflow/models/blob/master/resnet/resnet_main.py中运行resnet模型并看到OOM错误。我已经尝试了各种组合,删除使用GPU的代码,前缀为CUDA_VISIBLE_DEVICES ='0',并将batch_size减少到64.我仍然无法启动培训。你能帮我吗?
W tensorflow / core / common_runtime / bfc_allocator.cc:270] ********************** x *********** ************************************************** ************** XX W tensorflow / core / common_runtime / bfc_allocator.cc:271]尝试分配196.00MiB时内存不足。查看内存状态的日志。 W tensorflow / core / framework / op_kernel.cc:936]资源耗尽:OOM在分配形状的张量时[64,16,224,224] E tensorflow / core / client / tensor_c_api.cc:485] OOM在分配形状的张量时[64,16,224,224] [[节点:unit_1_2 / sub1 / conv1 / Conv2D = Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“SAME”,strides = [1,1,1,1],use_cudnn_on_gpu = true,_device =“/ job:localhost / replica:0 / task:0 / gpu:0“](unit_1_2 / residual_only_activation / leaky_relu,unit_1_2 / sub1 / conv1 / DW / read)]] [[Node:train_step / update / _1561 = _Recvclient_terminated = false,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0”,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0“,send_device_incarnation = 1,tensor_name =”edge_10115_train_step / update“,tensor_type = DT_FLOAT,_device =”/ job:localhost / replica:0 / task:0 / cpu:0“]] Traceback(最近一次调用最后一次): 文件“./resnet_main.py”,第203行,in tf.app.run() 文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第30行,在运行中 sys.exit(主(sys.argv中)) 在主要文件中输入“./resnet_main.py”,第197行 列车(HPS) 在火车上输入“./resnet_main.py”,第82行 feed_dict = {model.lrn_rate:lrn_rate}) 运行文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第382行 run_metadata_ptr) 在_run中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第655行 feed_dict_string,options,run_metadata) 在_do_run中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第723行 target_list,options,run_metadata) 在_do_call中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第743行 提升类型(e)(node_def,op,message) tensorflow.python.framework.errors.ResourceExhaustedError:分配张量形状时的OOM [64,16,224,224] [[节点:unit_1_2 / sub1 / conv1 / Conv2D = Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“SAME”,strides = [1,1,1,1],use_cudnn_on_gpu = true,_device =“/ job:localhost / replica:0 / task:0 / gpu:0“](unit_1_2 / residual_only_activation / leaky_relu,unit_1_2 / sub1 / conv1 / DW / read)]] [[Node:train_step / update / _1561 = _Recvclient_terminated = false,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0”,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0“,send_device_incarnation = 1,tensor_name =”edge_10115_train_step / update“,tensor_type = DT_FLOAT,_device =”/ job:localhost / replica:0 / task:0 / cpu:0“]] 由op u'unit_1_2 / sub1 / conv1 / Conv2D'引起,定义于: 文件“./resnet_main.py”,第203行,in tf.app.run() 文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第30行,在运行中 sys.exit(主(sys.argv中)) 在主要文件中输入“./resnet_main.py”,第197行 列车(HPS) 列车中的“./resnet_main.py”,第64行 model.build_graph() 在build_graph中输入文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第59行 self._build_model() 在_build_model中输入文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第94行 x = res_func(x,filters [1],filters [1],self._stride_arr(1),False) 文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第208行,在_residual x = self._conv('conv1',x,3,in_filter,out_filter,stride) 在_conv中输入文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第279行 return tf.nn.conv2d(x,kernel,strides,padding ='SAME') 文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py”,第394行,在conv2d中 data_format = data_format,name = name) 在apply_op中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”,第703行 op_def = op_def) 在create_op中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”,第2310行 original_op = self._default_original_op,op_def = op_def) 在 init 中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”,第1232行 self._traceback = _extract_stack()
答案 0 :(得分:0)
NVIDIA GRID K520拥有8GB内存(link)。我已经成功地在具有12GB内存的NVIDIA GPU上训练了ResNet模型。正如错误所示,TensorFlow尝试将所有网络权重放入GPU内存并失败。我相信你有几个选择:
我希望这可以帮助您和其他遇到类似内存问题的人。