当我想评估我的模型时,使用slim.evaluation.evaluate_once()函数,遇到NotFoundError。它告诉我无法找到模型的关键或值。像这样:
Running evaluation Loop...
INFO:tensorflow:Starting evaluation at 2017-08-25-11:40:57
INFO:tensorflow:Starting evaluation at 2017-08-25-11:40:57
INFO:tensorflow:Restoring parameters from tmp/flowers/finetune_log/model.ckpt-5000
INFO:tensorflow:Restoring parameters from tmp/flowers/finetune_log/model.ckpt-5000
NotFoundError Traceback (most recent call last)
/home/wangx/Dev_env/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1326 try:
-> 1327 return fn(*args)
1328 except errors.OpError as e:
/home/wangx/Dev_env/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
1305 feed_dict, fetch_list, target_list,
-> 1306 status, run_metadata)
1307
/usr/lib/python3.5/contextlib.py in __exit__(self, type, value, traceback)
65 try:
---> 66 next(self.gen)
67 except StopIteration:
...
NotFoundError (see above for traceback): Key InceptionV1/Mixed_4c/Branch_0/Conv2d_0a_1x1/biases not found in checkpoint
[[Node: save/RestoreV2_44 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_44/tensor_names, save/RestoreV2_44/shape_and_slices)]]
[[Node: save/RestoreV2_6/_1 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_238_save/RestoreV2_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
我将检查点保存在./tmp/flowers/finetune_log中,并按照教程下载鲜花照片。我从培训中得到的检查点文件有问题吗?或者当我做评估时我错过了什么?这是我的评估代码:
from datasets import flowers
from nets import inception
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
tf_global_step = slim.get_or_create_global_step()
dataset = flowers.get_split('validation', 'tmp/flowers')
images, labels = load_batch(dataset)
logits, endpoints = inception.inception_v1(images, num_classes=dataset.num_classes, is_training=False)
predictions =tf.argmax(logits, 1)
# Define the metrics:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
'eval/Accuracy': slim.metrics.streaming_accuracy(predictions, labels),
'eval/Recall': slim.metrics.streaming_recall(predictions, labels)})
print('Running evaluation Loop...')
checkpoint_path = tf.train.latest_checkpoint('tmp/flowers/finetune_log')
metric_values = slim.evaluation.evaluate_once(
num_evals=20,
master='',
checkpoint_path=checkpoint_path,
logdir='tmp/flowers/eval_finetune_log',
eval_op=names_to_updates.values(),
final_op=names_to_values.values())
以防万一,这是我的培训代码:
def get_init_fn():
"""Returns a function run by the chief worker to warm-start the training."""
checkpoint_exclude_scopes=["InceptionV1/Logits", "InceptionV1/AuxLogits"]
exclusions = [scope.strip() for scope in checkpoint_exclude_scopes]
variables_to_restore = []
for var in slim.get_model_variables():
excluded = False
for exclusion in exclusions:
if var.op.name.startswith(exclusion):
excluded = True
break
if not excluded:
variables_to_restore.append(var)
return slim.assign_from_checkpoint_fn(
os.path.join('tmp/checkpoints', 'inception_v1.ckpt'),
variables_to_restore)
train_dir = 'tmp/flowers/finetune_log'
with tf.Graph().as_default():
dataset = flowers.get_split('train', 'tmp/flowers')
images, labels = load_batch(dataset)
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, _ = inception.inception_v1(images, num_classes=dataset.num_classes, is_training=True)
one_hot_labels = slim.one_hot_encoding(labels, 5)
slim.losses.softmax_cross_entropy(logits, one_hot_labels)
total_loss = slim.losses.get_total_loss()
tf.summary.scalar('losses/Total Loss', total_loss)
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=5000,
save_summaries_secs=1)
print('done.')
非常感谢。它阻止了我很长时间。
答案 0 :(得分:0)
我发现在评估片段中,如果我进行以下两项更改,程序可以运行:
1.为模型定义slim.arg_scope(),我认为这是NotFoundError beacuse程序的原因,不知道模型的转换内核大小这样的参数,所以代码应该改变如下:
$ cat t405.cu
#include <stdio.h>
__global__ void tk(char *seq, int *offsets, int *seq_lengths, int num_seq){
if (blockIdx.x < num_seq)
if (threadIdx.x < seq_lengths[blockIdx.x])
printf("block: %d, thread: %d, seq: %c\n", blockIdx.x, threadIdx.x, seq[offsets[blockIdx.x]+threadIdx.x]);
}
int main(){
char seq[] = {'a','b','f','g','c','d','>','b','g','d','>','a','b', 'c', 'd', 'e', '>'};
int seq_length[] = { 6, 3, 5 };
int offsets[] = { 0, 7, 11 };
int num_seq = 3;
int seq_sz = sizeof(seq);
int seq_l_sz = sizeof(seq_length);
int off_sz = sizeof(offsets);
char *d_seq;
int *d_seq_length, *d_offsets;
cudaMalloc(&d_seq, seq_sz);
cudaMalloc(&d_seq_length, seq_l_sz);
cudaMalloc(&d_offsets, off_sz);
cudaMemcpy(d_seq, seq, seq_sz, cudaMemcpyHostToDevice);
cudaMemcpy(d_seq_length, seq_length, seq_l_sz, cudaMemcpyHostToDevice);
cudaMemcpy(d_offsets, offsets, off_sz, cudaMemcpyHostToDevice);
tk<<<num_seq, 1024>>>(d_seq, d_offsets, d_seq_length, num_seq);
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if (cudaSuccess != err) printf("cuda error: %s\n", cudaGetErrorString(err));
return 0;
}
$ nvcc -arch=sm_61 -o t405 t405.cu
$ ./t405
block: 1, thread: 0, seq: b
block: 1, thread: 1, seq: g
block: 1, thread: 2, seq: d
block: 2, thread: 0, seq: a
block: 2, thread: 1, seq: b
block: 2, thread: 2, seq: c
block: 2, thread: 3, seq: d
block: 2, thread: 4, seq: e
block: 0, thread: 0, seq: a
block: 0, thread: 1, seq: b
block: 0, thread: 2, seq: f
block: 0, thread: 3, seq: g
block: 0, thread: 4, seq: c
block: 0, thread: 5, seq: d
$
2.我删除了slim.metrics.aggregate_metric_map(),并使用一个简单的指标:
images, labels = load_batch(dataset)
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, _ = inception.inception_v1(images, num_classes=dataset.num_classes, is_training=True)
predictions =tf.argmax(logits, 1)
它现在可以运行了。