摘要:我有一个培训例程,尝试重新加载已保存的图表以继续培训,但在我尝试使用
IndexError: list index out of range
加载优化程序时生成optimizer = tf.get_collection("optimizer")[0]
。我一路上遇到了其他几个错误,但最终这个问题让我陷入困境。我终于想出来了,所以我会回答我自己的问题,以防它可以帮助别人。
目标很简单:在保存模型之前,我花了6个多小时训练模型,现在我想重新加载并训练它。然而,无论我做什么,我都会收到错误。
我在Github上找到very simple example只是创建了一个saver = tf.train.Saver()
运算符,然后saver.save(sess, model_path)
保存并saver.restore(sess, model_path)
加载。当我尝试这样做时,我得到At least two variables have the same name: decode/decoder/dense/kernel/Adam_1
。我正在使用Adam优化器,因此我猜测这与问题有关。我使用以下方法解决了这个问题。
我知道模型很好,因为在我的代码中进一步向下(见底部)我有一个预测例程,它加载已保存的模型并运行和输入,并且它有效。它使用loaded_graph = tf.Graph()
,然后使用loader = tf.train.import_meta_graph(checkpoint + '.meta')
加loader.restore(sess, checkpoint)
来加载模型。然后它会进行一堆loaded_graph.get_tensor_by_name('input:0')
次呼叫。
当我尝试这种方法时(你可以看到评论代码)"两个变量"问题消失了,但现在我得到一个TypeError: Cannot interpret feed_dict key as Tensor: The name 'save/Const:0' refers to a Tensor which does not exist. The operation, 'save/Const', does not exist in the graph.
This post很好地解释了如何组织代码以避免我已经完成的ValueError: cannot add op with name <my weights variable name>/Adam as that name is already used
。
@mmry解释了here上的TypeError,但是我不明白他在说什么,也不知道如何解决它。
我花了一整天的时间来处理事情并得到不同的错误,而且我已经没有想法了。帮助将不胜感激。
import time
# Split data to training and validation sets
train_source = source_letter_ids[batch_size:]
train_target = target_letter_ids[batch_size:]
valid_source = source_letter_ids[:batch_size]
valid_target = target_letter_ids[:batch_size]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = next(get_batches(valid_target, valid_source, batch_size,
source_letter_to_int['<PAD>'],
target_letter_to_int['<PAD>']))
if (len(source_sentences) > 10000):
display_step = 100 # Check training loss after each of this many batches with large data
else:
display_step = 20 # Check training loss after each of this many batches with small data
# loader = tf.train.import_meta_graph(checkpoint + '.meta')
# loaded_graph = tf.get_default_graph()
# input_data = loaded_graph.get_tensor_by_name('input:0')
# targets = loaded_graph.get_tensor_by_name('targets:0')
# lr = loaded_graph.get_tensor_by_name('learning_rate:0')
# source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
# target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
# keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
# loader = tf.train.Saver()
saver = tf.train.Saver()
with tf.Session(graph=train_graph) as sess:
start = time.time()
sess.run(tf.global_variables_initializer())
# loader.restore(sess, checkpoint)
# optimizer = tf.get_collection("optimization")[0]
# gradients = optimizer.compute_gradients(cost)
# capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
# train_op = optimizer.apply_gradients(capped_gradients)
for epoch_i in range(1, epochs+1):
for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
get_batches(train_target, train_source, batch_size,
source_letter_to_int['<PAD>'],
target_letter_to_int['<PAD>'])):
# Training step
_, loss = sess.run(
[train_op, cost],
{input_data: sources_batch,
targets: targets_batch,
lr: learning_rate,
target_sequence_length: targets_lengths,
source_sequence_length: sources_lengths,
keep_prob: keep_probability})
# Debug message updating us on the status of the training
if batch_i % display_step == 0 and batch_i > 0:
# Calculate validation cost
validation_loss = sess.run(
[cost],
{input_data: valid_sources_batch,
targets: valid_targets_batch,
lr: learning_rate,
target_sequence_length: valid_targets_lengths,
source_sequence_length: valid_sources_lengths,
keep_prob: 1.0})
print('Epoch {:>3}/{} Batch {:>6}/{} Inputs (000) {:>7} - Loss: {:>6.3f} - Validation loss: {:>6.3f}'
.format(epoch_i, epochs, batch_i, len(train_source) // batch_size,
(((epoch_i - 1) * len(train_source)) + batch_i * batch_size) // 1000,
loss, validation_loss[0]))
# Save model
saver = tf.train.Saver()
saver.save(sess, checkpoint)
# Print time spent training the model
end = time.time()
seconds = end - start
m, s = divmod(seconds, 60)
h, m = divmod(m, 60)
print('Model Trained in {}h:{}m:{}s and Saved'.format(int(h), int(m), int(s)))
此代码有效,因此我知道&#39;图表正在成功保存。
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
# Load saved model
loader = tf.train.import_meta_graph(checkpoint + '.meta')
loader.restore(sess, checkpoint)
input_data = loaded_graph.get_tensor_by_name('input:0')
logits = loaded_graph.get_tensor_by_name('predictions:0')
source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
#Multiply by batch_size to match the model's input parameters
answer_logits = sess.run(logits, {input_data: [text]*batch_size,
target_sequence_length: [len(text)]*batch_size,
source_sequence_length: [len(text)]*batch_size,
keep_prob: 1.0})[0]
这是训练密码中的另一个问题,试图遵循@ jie-zhou的建议。这一行optimizer = tf.get_collection("optimization")[0]
给了我IndexError: list index out of range
。该行仅在sess.run(tf.global_variables_initializer())
之后才有效,因此我没有看到我应该初始化的内容。
import time
# Split data to training and validation sets
train_source = source_letter_ids[batch_size:]
train_target = target_letter_ids[batch_size:]
valid_source = source_letter_ids[:batch_size]
valid_target = target_letter_ids[:batch_size]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = next(get_batches(valid_target, valid_source, batch_size,
source_letter_to_int['<PAD>'],
target_letter_to_int['<PAD>']))
if (len(source_sentences) > 10000):
display_step = 100 # Check training loss after each of this many batches with large data
else:
display_step = 20 # Check training loss after each of this many batches with small data
loader = tf.train.import_meta_graph(checkpoint + '.meta')
loaded_graph = tf.get_default_graph()
input_data = loaded_graph.get_tensor_by_name('input:0')
targets = loaded_graph.get_tensor_by_name('targets:0')
lr = loaded_graph.get_tensor_by_name('learning_rate:0')
source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
with tf.Session(graph=train_graph) as sess:
start = time.time()
sess.run(tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()))
loader.restore(sess, checkpoint)
optimizer = tf.get_collection("optimization")[0]
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)
for epoch_i in range(1, epochs+1):
for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
get_batches(train_target, train_source, batch_size,
source_letter_to_int['<PAD>'],
target_letter_to_int['<PAD>'])):
# Training step
_, loss = sess.run(
[train_op, cost],
{input_data: sources_batch,
targets: targets_batch,
lr: learning_rate,
target_sequence_length: targets_lengths,
source_sequence_length: sources_lengths,
keep_prob: keep_probability})
# Debug message updating us on the status of the training
if batch_i % display_step == 0 and batch_i > 0:
# Calculate validation cost
validation_loss = sess.run(
[cost],
{input_data: valid_sources_batch,
targets: valid_targets_batch,
lr: learning_rate,
target_sequence_length: valid_targets_lengths,
source_sequence_length: valid_sources_lengths,
keep_prob: 1.0})
print('Epoch {:>3}/{} Batch {:>6}/{} Inputs (000) {:>7} - Loss: {:>6.3f} - Validation loss: {:>6.3f}'
.format(epoch_i, epochs, batch_i, len(train_source) // batch_size,
(((epoch_i - 1) * len(train_source)) + batch_i * batch_size) // 1000,
loss, validation_loss[0]))
# Save model
saver = tf.train.Saver()
saver.save(sess, checkpoint)
# Print time spent training the model
end = time.time()
seconds = end - start
m, s = divmod(seconds, 60)
h, m = divmod(m, 60)
print('Model Trained in {}h:{}m:{}s and Saved'.format(int(h), int(m), int(s)))
尝试更紧密地关注this model,我已经添加了代码来检查图表的存在,并在我加载现有图表时执行不同的操作。我也建立了类似于预测代码,我知道它可以工作。一个重要的不同之处在于,与预测期间不同,我需要加载优化器进行培训。
使用全新图表可以正常运行,但仍无法加载现有图表。不过,我仍然在IndexError: list index out of range
获得optimizer = tf.get_collection("optimization")[0]
。
我已经删除了上面的一些代码,专注于必要的。
# Test to see if graph already exists
if os.path.exists(checkpoint + ".meta"):
print("Reloading existing graph to continue training.")
brand_new = False
train_graph = tf.Graph()
# saver = tf.train.import_meta_graph(checkpoint + '.meta')
# train_graph = tf.get_default_graph()
else:
print("Starting with new graph.")
brand_new = True
with train_graph.as_default():
saver = tf.train.Saver()
with tf.Session(graph=train_graph) as sess:
start = time.time()
if brand_new:
sess.run(tf.global_variables_initializer())
else:
# sess.run(tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()))
saver = tf.train.import_meta_graph(checkpoint + '.meta')
saver.restore(sess, checkpoint)
# Restore variables
input_data = train_graph.get_tensor_by_name('input:0')
targets = train_graph.get_tensor_by_name('targets:0')
lr = train_graph.get_tensor_by_name('learning_rate:0')
source_sequence_length = train_graph.get_tensor_by_name('source_sequence_length:0')
target_sequence_length = train_graph.get_tensor_by_name('target_sequence_length:0')
keep_prob = train_graph.get_tensor_by_name('keep_prob:0')
# Load the optimizer
# Commenting out this block gives 'ValueError: Operation name: "optimization/Adam"'
# Leaving it gives 'IndexError: list index out of range' at 'optimizer = tf.get_collection("optimizer")[0]'
optimizer = tf.get_collection("optimizer")[0]
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)
for epoch_i in range(1, epochs+1):
for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
get_batches(train_target, train_source, batch_size,
source_letter_to_int['<PAD>'],
target_letter_to_int['<PAD>'])):
# Training step
_, loss = sess.run(...)
# Debug message updating us on the status of the training
if batch_i % display_step == 0 and batch_i > 0:
# Calculate validation cost and output update to training
# Save model
# saver = tf.train.Saver()
saver.save(sess, checkpoint)
答案 0 :(得分:0)
optimizer = tf.get_collection("optimization")[0]
在尝试恢复已保存的图表时抛出了IndexError: list index out of range
,原因很简单,因为它没有被命名为&#34;当图形被构建时,图中没有任何内容被称为&#34;优化器&#34;。
培训步骤_, loss = sess.run([train_op, cost], {input_data: sources_batch, targets: targets_batch, lr: learning_rate, target_sequence_length: targets_lengths, source_sequence_length: sources_lengths, keep_prob: keep_probability})
需要input_data
,targets
,lr
,target_sequence_length
,source_sequence_length
和keep_prob
。可以看出,所有这些都是通过这段代码恢复的:
# Restore variables
input_data = train_graph.get_tensor_by_name('input:0')
targets = train_graph.get_tensor_by_name('targets:0')
lr = train_graph.get_tensor_by_name('learning_rate:0')
source_sequence_length = train_graph.get_tensor_by_name('source_sequence_length:0')
target_sequence_length = train_graph.get_tensor_by_name('target_sequence_length:0')
keep_prob = train_graph.get_tensor_by_name('keep_prob:0')
这是有效的,因为在构建图形时我会命名为&#34;这些变量中的每一个都有input_data = tf.placeholder(tf.int32, [None, None], name='input')
。
此外,培训步骤需要train_op
和cost
。 (值得注意的是,它并不直接需要optimizer
。我注意到这一点,而我天生的尝试生成train_op
并不起作用。)
最终解决方案非常简单。在我构建图表的代码中,在创建train_op
和cost
后立即运行tf.add_to_collection("train_op", train_op)
和tf.add_to_collection("cost", cost)
。这些陈述&#34; name&#34;图中的操作,所以我可以稍后再获取它们。然后,在训练例程中,在恢复上面的变量之后,我运行它:
# Grab the optimizer variables that were added to the collection during build
cost = tf.get_collection("cost")[0]
train_op = tf.get_collection("train_op")[0]
这两个现在都有效,加载了已保存的图表,识别了所有必要的变量和操作,并且培训从中断的地方开始。