成功完成1000

时间:2017-05-24 23:20:02

标签: google-app-engine tensorflow google-cloud-platform tensorflow-serving google-cloud-ml-engine

我已经浏览了关于人口普查数据的cloudML教程:cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction,其中Job已成功完成。但是,当我浏览花卉图像数据的教程时:https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow我的训练任务看起来是成功的,基于从日志中完成1000步。但是,从此快照StackDriver logs完成后,它表示作业失败。我尝试使用相同的结构替换人口普查数据演练中的命令行参数,删除并重新创建JOB_ID和--output_path用户参数,使用STANDARD_1比例级但无效。我可以从社区获得任何帮助。谢谢!

以下是错误,您可以看到弹出日志快照的尾端:

* {  textPayload:"副本主机0退出时的非零状态为1.终止原因:错误。 Traceback(最近一次调用最后一次):   文件" /usr/lib/python2.7/runpy.py",第162行,在_run_module_as_main中     " __ main __",fname,loader,pkg_name)   文件" /usr/lib/python2.7/runpy.py",第72行,在_run_code中     run_globals中的exec代码   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第542行,     tf.app.run()   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py",第44行,运行中     _sys.exit(main(_sys.argv [:1] + flags_passthrough))   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第305行,主要     run(model,argv)   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第436行,在运行中     调度(args,模型,集群,任务)   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第477行,发送     培训师(args,model,cluster,task).run_training()   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第241行,在run_training中     self.eval(会话)   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py",第283行,在eval中     self.model.format_metric_values(self.evaluator.evaluate()))   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py",第95行,在评估中     返回metric_values   文件" /usr/lib/python2.7/contextlib.py",第35行,在__exit__中     self.gen.throw(类型,值,追溯)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py",第960行,在managed_session中     self.stop的(close_summary_writer = close_summary_writer)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py" ;,第788行,停止     stop_grace_period_secs = self._stop_grace_secs)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py" ;,第386行,在加入     six.reraise( self._exc_info_to_raise)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py" ;,第234行,在_run中     sess.run(enqueue_op)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py" ;,第767行,在运行中     run_metadata_ptr)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py" ;,第965行,在_run中     feed_dict_string,options,run_metadata)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",第1015行,在_do_run中     target_list,options,run_metadata)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",第1035行,在_do_call中     提升类型(e)(node_def,op,message) NotFoundError:执行HTTP请求时出错(HTTP响应代码404,错误代码0,错误消息'')      当读取gs:// project-166422-ml / User / flowers_User_20170522_121407 / preproc / eval      [[Node:ReaderReadUpToV2 = ReaderReadUpToV2 [_device =" / job:localhost / replica:0 / task:0 / cpu:0"](TFRecordReaderV2,input_producer,ReaderReadUpToV2 / num_records)]] 由op u' ReaderReadUpToV2'引起,定义于:   文件" /usr/lib/python2.7/runpy.py",第162行,在_run_module_as_main中     " __ main __",fname,loader,pkg_name)   文件" /usr/lib/python2.7/runpy.py",第72行,在_run_code中     run_globals中的exec代码   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第542行,     tf.app.run()   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py",第44行,运行中     _sys.exit(main(_sys.argv [:1] + flags_passthrough))   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第305行,主要     run(model,argv)   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第436行,在运行中     调度(args,模型,集群,任务)   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第477行,发送     培训师(args,model,cluster,task).run_training()   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py" ;,第241行,在run_training中     self.eval(会话)   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py",第283行,在eval中     self.model.format_metric_values(self.evaluator.evaluate()))   文件" /root/.local/lib/python2.7/site-packages/trainer/task.py",第57行,在评估中     self.eval_batch_size)   文件" /root/.local/lib/python2.7/site-packages/trainer/model.py" ;,第310行,在build_eval_graph中     return self.build_graph(data_paths,batch_size,GraphMod.EVALUATE)   文件" /root/.local/lib/python2.7/site-packages/trainer/model.py" ;,第231行,在build_graph中     num_epochs =如果is_training else则为无2)   文件" /root/.local/lib/python2.7/site-packages/trainer/util.py",第52行,在read_examples中     filename_queue,batch_size)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py",第226行,在read_up_to中     名称=名称)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py",第380行,在_reader_read_up_to_v2中     num_records = num_records,name = name)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py",第763行,在apply_op中     op_def = op_def)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py" ;,第2327行,在create_op中     original_op = self._default_original_op,op_def = op_def)   文件" /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py" ;,第1226行,在__init__中     self._traceback = _extract_stack() NotFoundError(参见上面的回溯):执行HTTP请求时出错(HTTP响应代码404,错误代码0,错误消息'')      当读取gs:// project-166422-ml / User / flowers_User_20170522_121407 / preproc / eval      [[Node:ReaderReadUpToV2 = ReaderReadUpToV2 [_device =" / job:localhost / replica:0 / task:0 / cpu:0"](TFRecordReaderV2,input_producer,ReaderReadUpToV2 / num_records)]] 要了解有关您的工作退出原因的详情,请查看日志:console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22% 0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"

1 个答案:

答案 0 :(得分:0)

错误表示尝试阅读

时未找到404
gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval

该文件是否存在?

根据名称,我猜测评估数据。所以我的猜测是你每1000步只运行一次评估,这就是1000步成功完成的原因。然后它尝试运行评估,但由于数据不存在而失败。