首页 文章

成功完成1000后,Cloud ML上的作业失败

提问于
浏览
1

我已经浏览了关于人口普查数据的cloudML教程:cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction,其中Job成功完成 . 但是,当我在花图像数据上阅读本教程时:https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow我的训练任务看起来是成功的,基于从日志中完成1000步 . 但是,从此快照StackDriver logs完成后,它表示作业失败 . 我尝试使用相同的结构替换人口普查数据演练中的命令行参数,删除并重新创建JOB_ID和--output_path用户参数,使用STANDARD_1比例级但无效 . 我可以从社区获得任何帮助 . 谢谢!

以下是错误,您可以看到弹出日志快照的尾端:

*{ textPayload: "The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main run(model, argv) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run dispatch(args, model, cluster, task) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch Trainer(args, model, cluster, task).run_training() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training self.eval(session) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval self.model.format_metric_values(self.evaluator.evaluate())) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 95, in evaluate return metric_values File "/usr/lib/python2.7/contextlib.py", line 35, in exit self.gen.throw(type, value, traceback) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop stop_grace_period_secs=self._stop_grace_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join six.reraise(self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run sess.run(enqueue_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) NotFoundError: Error executing an HTTP request (HTTP response code 404, error code 0, error message '') when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]] Caused by op u'ReaderReadUpToV2', defined at: File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main run(model, argv) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run dispatch(args, model, cluster, task) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch Trainer(args, model, cluster, task).run_training() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training self.eval(session) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval self.model.format_metric_values(self.evaluator.evaluate())) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in evaluate self.eval_batch_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 310, in build_eval_graph return self.build_graph(data_paths, batch_size, GraphMod.EVALUATE) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph num_epochs=None if is_training else 2) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 52, in read_examples filename_queue, batch_size) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 226, in read_up_to name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 380, in _reader_read_up_to_v2 num_records=num_records, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in init self._traceback = _extract_stack() NotFoundError (see above for traceback): Error executing an HTTP request (HTTP response code 404, error code 0, error message '') when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]] To find out more about why your job exited please check the logs: console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"

1 回答

  • 0

    该错误表示尝试读取时未找到404

    gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
    

    该文件存在吗?

    根据名称,我猜测的是评估数据 . 所以我的猜测是你每1000步只运行一次评估,这就是1000步成功完成的原因 . 然后它尝试运行评估,但由于数据不存在而失败 .

相关问题