如何在分配张量时解决ResourceExhaustedError:OOM的问题?
ResourceExhaustedError(参见上面的回溯):分配时的OOM 张量形状[10000,32,28,28]
我包含了几乎所有的代码
learning_rate = 0.0001
epochs = 10
batch_size = 50
# declare the training data placeholders
# input x - for 28 x 28 pixels = 784 - this is the flattened image data that is drawn from
# mnist.train.nextbatch()
x = tf.placeholder(tf.float32, [None, 784])
# dynamically reshape the input
x_shaped = tf.reshape(x, [-1, 28, 28, 1])
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.float32, [None, 10])
def create_new_conv_layer(input_data, num_input_channels, num_filters, filter_shape, pool_shape, name):
# setup the filter input shape for tf.nn.conv_2d
conv_filt_shape = [filter_shape[0], filter_shape[1], num_input_channels,
num_filters]
# initialise weights and bias for the filter
weights = tf.Variable(tf.truncated_normal(conv_filt_shape, stddev=0.03),
name=name+'_W')
bias = tf.Variable(tf.truncated_normal([num_filters]), name=name+'_b')
# setup the convolutional layer operation
out_layer = tf.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding='SAME')
# add the bias
out_layer += bias
# apply a ReLU non-linear activation
out_layer = tf.nn.relu(out_layer)
# now perform max pooling
ksize = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
out_layer = tf.nn.max_pool(out_layer, ksize=ksize, strides=strides,
padding='SAME')
return out_layer
# create some convolutional layers
layer1 = create_new_conv_layer(x_shaped, 1, 32, [5, 5], [2, 2], name='layer1')
layer2 = create_new_conv_layer(layer1, 32, 64, [5, 5], [2, 2], name='layer2')
flattened = tf.reshape(layer2, [-1, 7 * 7 * 64])
# setup some weights and bias values for this layer, then activate with ReLU
wd1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1000], stddev=0.03), name='wd1')
bd1 = tf.Variable(tf.truncated_normal([1000], stddev=0.01), name='bd1')
dense_layer1 = tf.matmul(flattened, wd1) + bd1
dense_layer1 = tf.nn.relu(dense_layer1)
# another layer with softmax activations
wd2 = tf.Variable(tf.truncated_normal([1000, 10], stddev=0.03), name='wd2')
bd2 = tf.Variable(tf.truncated_normal([10], stddev=0.01), name='bd2')
dense_layer2 = tf.matmul(dense_layer1, wd2) + bd2
y_ = tf.nn.softmax(dense_layer2)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=dense_layer2, labels=y))
# add an optimiser
optimiser = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cross_entropy)
# define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# setup the initialisation operator
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
# initialise the variables
sess.run(init_op)
total_batch = int(len(mnist.train.labels) / batch_size)
for epoch in range(epochs):
avg_cost = 0
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)
_, c = sess.run([optimiser, cross_entropy], feed_dict={x:
batch_x,
y: batch_y})
avg_cost += c / total_batch
test_acc = sess.run(accuracy,feed_dict={x: mnist.test.images, y:
mnist.test.labels})
print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost), "
test accuracy: {:.3f}".format(test_acc))
print("\nTraining complete!")
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y:
mnist.test.labels}))
并且错误中引用的那些行是:
create_new_conv_layer - function
sess.run ..在训练循环中
我从调试器输出中复制的更多错误列在下面(有更多行,但我认为这些是主要的,其他是由此引起的。)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配张量形状的OOM [10000,32,28,28] [[节点:Conv2D = Conv2D [T = DT_FLOAT,data_format =" NHWC&#34 ;, padding =" SAME&#34 ;, strides = [1,1, 1,1],use_cudnn_on_gpu = true, _device =" / job:localhost / replica:0 / task:0 / gpu:0"](重塑,layer1_W / read)]]
我第二次运行它是发出以下错误我有cpu和GPU可以在下面的输出中看到,我可以理解一些与cpu问题相关的错误可能是因为我的tensorflow没有编译使用那些功能,我在Windows 10上安装了cuda 8和cudnn 6,python 3.5,tensorflow 1.3.0。
2017-10-03 03:53:58.944371:W C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \核心\平台\ cpu_feature_guard.cc:45] TensorFlow库未编译为使用AVX指令,但是 这些都可以在您的机器上使用,并可以加速CPU 计算。 2017-10-03 03:53:58.945563:W C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \核心\平台\ cpu_feature_guard.cc:45] TensorFlow库未编译为使用AVX2指令,但是 这些都可以在您的机器上使用,并可以加速CPU 计算。 2017-10-03 03:53:59.230761:我 C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \核心\ common_runtime \ GPU \ gpu_device.cc:955] 找到具有属性的设备0: 名称:Quadro K620主要:5次要:0 memoryClockRate(GHz)1.124 pciBusID 0000:01:00.0总内存:2.00GiB可用内存:1.66GiB 2017-10-03 03:53:59.231109:我 C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \核心\ common_runtime \ GPU \ gpu_device.cc:976] DMA:0 2017-10-03 03:53:59.231229:我 C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \核心\ common_runtime \ GPU \ gpu_device.cc:986] 0:Y 2017-10-03 03:53:59.231363:我 C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \核心\ common_runtime \ GPU \ gpu_device.cc:1045] 创建TensorFlow设备(/ gpu:0) - > (设备:0,名称:Quadro K620, pci bus id:0000:01:00.0)2017-10-03 03:54:01.511141:E C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \ stream_executor \ CUDA \ cuda_dnn.cc:371] 无法创建cudnn句柄:CUDNN_STATUS_NOT_INITIALIZED 2017-10-03 03:54:01.511372:E C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \ stream_executor \ CUDA \ cuda_dnn.cc:375] 错误检索驱动程序版本:未实现:内核报告的驱动程序版本未在Windows上实现 2017-10-03 03:54:01.511862:E C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \ stream_executor \ CUDA \ cuda_dnn.cc:338] 无法破坏cudnn句柄:CUDNN_STATUS_BAD_PARAM 2017-10-03 03:54:01.512074:F C:\ tf_jenkins \家庭\工作区\ REL-WIN \中号\ WINDOWS-GPU \ PY \ 35 \ tensorflow \核心\仁\ conv_ops.cc:672] 检查失败:stream-> parent() - > GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(),& algorithms)
答案 0 :(得分:2)
由于您一次推送整个测试集以进行评估,因此进程因内存不足(OOM)而失败(请参阅this question)。很容易看出10000 * 32 * 28 * 28 * 4
几乎是1Gb,而你的GPU总共只有1.66Gb可用,而且大部分已经被网络本身使用了。
解决方案是为神经网络批次提供饲料,不仅用于培训,还用于测试。结果准确度将是所有批次的平均值。此外,您不需要在每个时代之后执行此操作:您是否真的对所有中间网络的测试结果感兴趣?
您的第二条错误消息很可能是之前失败的结果,因为CUDNN驱动程序似乎不再起作用。我建议重启你的机器。