Question

假设我有一个非常小的数据集，只有50个图像。我想重新使用Red Pill教程中的代码，但是在每批训练中对随机变换应用随机变换，比如对亮度，对比度等进行随机更改。我只添加了一个函数：< / p>

def preprocessImages(x):
    retValue = numpy.empty_like(x)
    for i in range(50):
        image = x[i]
        image = tf.reshape(image, [28,28,1])
        image = tf.image.random_brightness(image, max_delta=63)
        #image = tf.image.random_contrast(image, lower=0.2, upper=1.8)
        # Subtract off the mean and divide by the variance of the pixels.
        float_image = tf.image.per_image_whitening(image)
        float_image_Mat = sess.run(float_image)
        retValue[i] = float_image_Mat.reshape((28*28))
    return retValue

旧代码的小改动：

batch = mnist.train.next_batch(50)
for i in range(1000):
  #batch = mnist.train.next_batch(50)
  if i%100 == 0:
    train_accuracy = accuracy.eval(feed_dict={
        x:preprocessImages(batch[0]), y_: batch[1], keep_prob: 1.0})
    print("step %d, training accuracy %g"%(i, train_accuracy))
  train_step.run(feed_dict={x: preprocessImages(batch[0]), y_: batch[1], keep_prob: 0.5})

第一次迭代成功，之后崩溃：

step 0, training accuracy 0.02
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_12_grad/Relu_12/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_16)]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_13_grad/Relu_13/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_17)]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_14_grad/Relu_14/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_18)]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 69, in <module>
    train_step.run(feed_dict={x: preprocessImages(batch[0]), y_: batch[1], keep_prob: 0.5})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1267, in run
    _run_using_default_session(self, feed_dict, self.graph, session)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2763, in _run_using_default_session
    session.run(operation, feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 345, in run
    results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 419, in _do_run
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_12_grad/Relu_12/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_16)]]
Caused by op u'gradients_4/Relu_12_grad/Relu_12/CheckNumerics', defined at:
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 58, in <module>
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 165, in minimize
    gate_gradients=gate_gradients)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 205, in compute_gradients
    loss, var_list, gate_gradients=(gate_gradients == Optimizer.GATE_OP))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 414, in gradients
    in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 107, in _ReluGrad
    t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 100, in _VerifyTensor
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'Relu_12', defined at:
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 34, in <module>
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 506, in relu
    return _op_def_lib.apply_op("Relu", features=features, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

这与我使用50个培训示例的个人数据集得到的错误完全相同。

Answer 1

首先要做的是：使用合并的tf.softmax_cross_entropy_with_logits运算符，而不是计算y_conv然后计算交叉熵。这可能无法解决您的问题，但它比Red Pill示例中的天真版本更稳定。

其次，尝试在每次迭代时打印出cross_entropy。

cross_entropy = .... (previous code here)
cross_entropy = tf.Print(cross_entropy, [cross_entropy], "Cross-entropy: ")

了解它是否随着模型的进展而变为无穷大，或者只是跳到inf或NaN。如果它逐渐爆炸，那么它可能是学习率。如果它跳跃，它可能是一个可以如上解决的数值边界条件。如果它从一开始就在那里，你可能会在你应用扭曲的方式上出错，最终会以某种方式输入可怕的数据。

Tensorflow卷积神经网络 - 使用小型数据集进行训练，对图像进行随机更改

1 个答案: