批量大小上的Tensorflow结果不一致

时间:2018-10-22 03:03:34

标签: python tensorflow gpu

我正在进行批MLP评估。由于GPU内存大小有限,因此我必须使用非常大的矩阵(例如1000000x43)的小批量(例如1000x43)。我发现批次大小的计算结果不一致。这是一个最小的示例:

with tf.Session():
    x = tf.Variable(np.random.rand(43, 60000).astype('f2'))
    w = tf.Variable(np.ones((58624, 43), 'f2'))
    b = tf.Variable(np.ones((58624, 1), 'f2'))
    v1 = w @ x + b
    w = tf.Variable(np.ones((5376, 43), 'f2'))
    b = tf.Variable(np.ones((5376, 1), 'f2'))
    v2 = w @ x + b
    tf.global_variables_initializer().run()
    print(v1.eval(), v2.eval())

输出:

[[ 21.375     18.578125  22.765625 ...,  23.828125  24.203125  22.0625  ]
 [ 21.375     18.578125  22.765625 ...,  23.828125  24.203125  22.0625  ]
 [ 21.375     18.578125  22.765625 ...,  23.828125  24.203125  22.0625  ]
 ..., 
 [ 21.375     18.578125  22.765625 ...,  23.828125  24.203125  22.0625  ]
 [ 21.375     18.578125  22.765625 ...,  23.828125  24.203125  22.0625  ]
 [ 21.375     18.578125  22.765625 ...,  23.828125  24.203125  22.0625  ]] [[ 22.375     19.578125  23.765625 ...,  24.828125  25.203125  23.0625  ]
 [ 22.375     19.578125  23.765625 ...,  24.828125  25.203125  23.0625  ]
 [ 22.375     19.578125  23.765625 ...,  24.828125  25.203125  23.0625  ]
 ..., 
 [ 22.375     19.578125  23.765625 ...,  24.828125  25.203125  23.0625  ]
 [ 22.375     19.578125  23.765625 ...,  24.828125  25.203125  23.0625  ]
 [ 22.375     19.578125  23.765625 ...,  24.828125  25.203125  23.0625  ]]

显然,两个批处理大小(58624和5376)的结果相差1,但由于矩阵wb的第一轴应该是可广播的,所以不会发生。 / p>

我正在使用tensorflow-gpu 1.11.0中的conda,平台是Ubuntu Server 18.04.01,体系结构amd64,图形卡Nvidia Titan Xp。确实存在错误,还是我的逻辑在某处错误?

顺便说一句:

  • 在此示例中,结果似乎与批处理大小的值无关,但在我的实际(更大)示例中,该示例使用常规和批处理矩阵乘法(均为@)进行广播-ed tanh,张量加法(+)和tf.linalg.norm在最后两个轴上分批计算矩阵范数,结果似乎与批大小成正比。

    < / li>
  • 如果我从+bv1中删除v2部分,则结果是等效的。但是,如果我用随机张量替换矩阵乘法,就无法重现该问题。

更大的例子在这里:

x = tf.Variable(np.ones((43, 60000), 'f2'))
y = tf.Variable(np.ones((8, 60000), 'f2'))
w = tf.Variable(np.ones((916* 64, 43), 'f2'))
b = tf.Variable(np.ones((916* 64, 1), 'f2'))
v = tf.Variable(np.ones((916, 8, 64), 'f2'))
c = tf.Variable(np.ones((916, 8, 1), 'f2'))
r = tf.linalg.norm(tf.cast(v@tf.reshape(tf.tanh(w@x+b),(916,64,60000))+c,'float32'),axis=(1,2))#prevent overflow
w = tf.Variable(np.ones((84* 64, 43), 'f2'))
b = tf.Variable(np.ones((84* 64, 1), 'f2'))
v = tf.Variable(np.ones((84, 8, 64), 'f2'))
c = tf.Variable(np.ones((84, 8, 1), 'f2'))
R = tf.linalg.norm(tf.cast(v@tf.reshape(tf.tanh(w@x+b),(84,64,60000))+c,'float32'),axis=(1,2))#prevent overflow
tf.global_variables_initializer().run()
print(r.eval()[::100], R.eval()[::10])#summary

产生:

[ 1906641.625  1906641.625  1906641.625  1906641.625  1906641.625
  1906641.625  1906641.625  1906641.625  1906641.625  1906641.625]
[ 45033.32421875  45033.32421875  45033.32421875  45033.32421875
  45033.32421875  45033.32421875  45033.32421875  45033.32421875
  45033.32421875]

谢谢!

0 个答案:

没有答案