Question

我正在尝试通过阅读他们的教程并进行少量修改来学习张量流。我遇到了一个错误，即对代码进行微小的更改会导致输出变为nan。

他们的原始代码是：

import numpy as np
import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# training data
x_train = [1,2,3,4]
y_train = [0,-1,-2,-3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
  sess.run(train, {x:x_train, y:y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

这个输出是：

>python linreg2.py
2017-07-22 22:19:41.409167: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:19:41.409311: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:19:41.412452: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:19:41.412556: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:19:41.412683: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:19:41.412826: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:19:41.412958: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:19:41.413086: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
W: [-0.9999969] b: [ 0.99999082] loss: 5.69997e-11

请注意我每次运行时收到的所有消息，因为我使用pip进行安装，而不是自己编译。但是，它确实得到正确的输出，W = -1且b = 1

我将代码修改为此，只需添加到x_train和y_train变量：

import numpy as np
import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# training data
x_train = [1,2,3,4,5,6,7]
y_train = [0,-1,-2,-3,-4,-5,-6]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
  sess.run(train, {x:x_train, y:y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

这是这段新代码的输出：

2017-07-22 22:23:13.129983: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:23:13.130125: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:23:13.130853: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:23:13.130986: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:23:13.131126: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:23:13.131234: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:23:13.132178: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-22 22:23:13.132874: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
W: [ nan] b: [ nan] loss: nan

我真的不知道为什么延长训练数据会导致这种情况发生。有什么我想念的吗？

另外，我完全不确定如何在TF中调试内容，比如在循环中逐步增加打印值并更改变量。只是打印变量似乎不起作用。我想知道，所以我可以在将来为自己调试这些东西！

Answer 1

欢迎来到超参数调整的精彩世界。您可以尝试以下方法，首先不是在最后提供一些输出，您也可以在for循环中打印一些输出，然后可能会变为：

for i in range(1000):
    curr_W, curr_b, curr_loss,_ = sess.run([W, b, loss, train], {x:x_train, y:y_train})
    print("Iteration %d W: %s b: %s loss: %s"%(i, curr_W, curr_b, curr_loss))

如果你运行它，那么输出如下：

Iteration 0 W: [-2.61199999] b: [-0.84599996] loss: 153.79
Iteration 1 W: [ 2.93535995] b: [ 0.31516004] loss: 554.292
Iteration 2 W: [-7.70013809] b: [-1.79276371] loss: 2020.55
Iteration 3 W: [ 12.6241951] b: [ 2.35030031] loss: 7387.32
Iteration 4 W: [-26.27972031] b: [-5.46829081] loss: 27029.6
Iteration 5 W: [ 48.12573624] b: [ 9.59391212] loss: 98918.8
Iteration 6 W: [-94.23892212] b: [-19.11964607] loss: 362027.0
Iteration 7 W: [ 178.09707642] b: [ 35.9108963] loss: 1.32498e+06
Iteration 8 W: [-342.92483521] b: [-69.27098846] loss: 4.84928e+06
Iteration 9 W: [ 653.81640625] b: [ 132.04486084] loss: 1.77479e+07
Iteration 10 W: [-1253.05480957] b: [-252.99859619] loss: 6.49554e+07
...
Iteration 60 W: [ -1.52910250e+17] b: [ -3.08788499e+16] loss: 9.6847e+35
Iteration 61 W: [  2.92530566e+17] b: [  5.90739251e+16] loss: 3.54451e+36
Iteration 62 W: [ -5.59636369e+17] b: [ -1.13013526e+17] loss: 1.29725e+37
Iteration 63 W: [  1.07063302e+18] b: [  2.16204754e+17] loss: 4.74782e+37
Iteration 64 W: [ -2.04821397e+18] b: [ -4.13618407e+17] loss: 1.73766e+38
Iteration 65 W: [  3.91841178e+18] b: [  7.91287870e+17] loss: inf
Iteration 66 W: [ -7.49626247e+18] b: [ -1.51380280e+18] loss: inf
Iteration 67 W: [  1.43410016e+19] b: [  2.89603611e+18] loss: inf
Iteration 68 W: [ -2.74355815e+19] b: [ -5.54036982e+18] loss: inf
Iteration 69 W: [  5.24866609e+19] b: [  1.05992074e+19] loss: inf
...
Iteration 126 W: [ -6.01072457e+35] b: [ -1.21381189e+35] loss: inf
Iteration 127 W: [  1.14990384e+36] b: [  2.32212753e+35] loss: inf
Iteration 128 W: [ -2.19986564e+36] b: [ -4.44243161e+35] loss: inf
Iteration 129 W: [ inf] b: [  8.49875587e+35] loss: inf
Iteration 130 W: [ nan] b: [-inf] loss: inf
Iteration 131 W: [ nan] b: [ nan] loss: nan
Iteration 132 W: [ nan] b: [ nan] loss: nan

此时你应该能够看到W和b的值正在积极地更新，而不是减少你的损失实际上正在增加，并且非常快地接近无穷大。这反过来意味着你的学习率很低。如果将学习率除以10并将其设置为0.001，则最终结果为：

W: [-0.97952145] b: [ 0.8985914] loss: 0.0144026

然后这表明你的模型还没有收敛（也看一下之前的输出，理想情况下你会创建一个亏损图。下一个学习率设置为0.05的实验给出：

W: [-0.99999958] b: [ 0.99999791] loss: 6.48015e-12

因此得出结论：

尝试从sess.run（）（或某些张量的eval（））中提取中间结果，并查看模型的学习方式。
超级参数调整，以获得乐趣和利润。

注意：此时您仍然使用具有固定学习速率的“简单”梯度下降，但也有自动调整学习速率的优化器。优化器（及其参数）的选择也是其他超参数。

Tensorflow返回Nan以进行简单的计算

1 个答案: