我遇到一个问题,我的Tensorflow执行被卡在compute_gradients
上。我正在初始化我的模型,然后像这样设置损失函数。请注意,目前我还没有开始训练,所以问题不在于我的数据。
# The model for training
given_model = GivenModel(images_input=images_t)
print("Done setting up the model")
with tf.device('/gpu:0'):
with tf.variable_scope('prediction_loss'):
logits = given_model.prediction
softmax_loss_per_sample = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels))
total_training_loss = softmax_loss_per_sample
optimizer = tf.train.AdamOptimizer()
gradients, variables = zip(*optimizer.compute_gradients(total_training_loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
with tf.control_dependencies([optimize]):
train_op = tf.constant(0)
此代码只是挂起,不执行任何操作。当我ctrl + c退出(无论运行多长时间)时,它总是卡在compute_gradients
上。
有人知道为什么会这样吗?我不是在一个循环中这样做,我的模型也不是很大。它似乎也正在使用CPU来执行此操作(GPU上尚未分配内存),尽管有with tf.device('/gpu:0'):
选项,我也不能强迫它使用GPU。
谢谢
这是我执行ctrl + c时打印的内容:
gradients, variables = zip(*optimizer.compute_gradients(total_training_loss))
File ".local/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 35$
, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 48$
, in gradients
in_grads = grad_fn(op, *out_grads)
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 269, in _$
eluGrad
return gen_nn_ops._relu_grad(grad, op.outputs[0])
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 2212, $
n _relu_grad
features=features, name=name)
File ".local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", l$
ne 763, in apply_op
op_def=op_def)
File ".local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, i$
create_op
original_op=self._default_original_op, op_def=op_def)
File ".local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1268, i$
__init__
self._control_flow_context.AddOp(self)
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line $
039, in AddOp
self._AddOpInternal(op)
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line $
062, in _AddOpInternal
real_x = self.AddValue(x)
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line $
998, in AddValue
real_val = grad_ctxt.grad_state.GetRealValue(val)
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line $
001, in GetRealValue
history_value = cur_grad_state.AddForwardAccumulator(cur_value)
File ".local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 8
92, in AddForwardAccumulator
self.forward_index.op._add_control_input(push.op)
File ".local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1434, in
_add_control_input
self._add_control_inputs([op])
File ".local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1422, in
_add_control_inputs
self._recompute_node_def()
File ".local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1442, in
_recompute_node_def
self._control_inputs])
File ".local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1317, in
name
return self._node_def.name
KeyboardInterrupt
答案 0 :(得分:1)
如果此时您尚未开始训练,则可能与图形构造有关。您确定GivenModel是正确的吗? 因为我使用您对优化器的定义修改了该自动编码器example,但在执行此代码时我没有发现任何问题:
from __future__ import division, print_function, absolute_import
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
# Training Parameters
learning_rate = 0.01
num_steps = 10
batch_size = 8
# Network Parameters
num_hidden_1 = 256 # 1st layer num features
num_hidden_2 = 128 # 2nd layer num features (the latent dim)
num_input = 784 # MNIST data input (img shape: 28*28)
# tf Graph input (only pictures)
X = tf.placeholder("float", [None, num_input])
weights = {
'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1])),
'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2])),
'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1])),
'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input])),
}
biases = {
'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1])),
'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2])),
'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1])),
'decoder_b2': tf.Variable(tf.random_normal([num_input])),
}
# Building the encoder
def encoder(x):
# Encoder Hidden layer with sigmoid activation #1
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']),
biases['encoder_b1']))
# Encoder Hidden layer with sigmoid activation #2
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']),
biases['encoder_b2']))
return layer_2
# Building the decoder
def decoder(x):
# Decoder Hidden layer with sigmoid activation #1
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']),
biases['decoder_b1']))
# Decoder Hidden layer with sigmoid activation #2
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']),
biases['decoder_b2']))
return layer_2
# Construct model
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)
# Prediction
y_pred = decoder_op
# Targets (Labels) are the input data.
y_true = X
# Define loss and optimizer, minimize the squared error
### your code with a reconstruction loss
with tf.device('/gpu:0'):
with tf.variable_scope('prediction_loss'):
loss = tf.reduce_mean(tf.pow(y_true - y_pred, 2))
optimizer = tf.train.AdamOptimizer()
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))
with tf.control_dependencies([optimize]):
train_op = tf.constant(0)
### end of your code
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
# Start Training
# Start a new TF session
with tf.Session() as sess:
# Run the initializer
sess.run(init)
# Training
for i in range(1, num_steps+1):
# Prepare Data
# Get the next batch of MNIST data (only images are needed, not labels)
batch_x, _ = mnist.train.next_batch(batch_size)
# Run optimization op (backprop) and cost op (to get loss value)
_, l = sess.run([train_op, loss], feed_dict={X: batch_x})
# Display logs per step
print('Step %i: Minibatch Loss: %f' % (i, l))
因此,我认为问题可能出在模型的其余部分,但是请确保我们需要模型的更多细节。
现在,关于模型的放置位置是cpu还是gpu。如果您没有在cpu上定义任何内容,则将自动为您选择gpu设备。因此,理论上该模型将自动分配给gpu。但是,再次,也许图的构造有问题,并且它还没有达到在gpu内存中实际分配模型的程度。
答案 1 :(得分:0)
对我来说,问题在于模型太大。减小它可以解决此问题。
答案 2 :(得分:0)
我遇到此问题的原因有三个:
模型较大,因此请减小批量大小
有无梯度的var:
clone_gradients = optimizer.compute_gradients(total_clone_loss)
for grad_and_vars in zip(*clone_grads):
tf.logging.info("clone_grads"+str(grad_and_vars))
它打印:
INFO:tensorflow:clone_grads((,),)之后 INFO:tensorflow:在clone_grads((None,),)