为了使用Estimator API在我的模型的拟合阶段之后获得最佳检查点,我最终使用了tf.train.SessionRunHook,其中我引用了估算器创建的tf.train.Saver实例以手动保存最佳到目前为止,在拟合过程中找到的检查点到一个单独的目录 . 我的目的是让估算器实例在执行检查点时遵循它的过程,但如果我找到一组更好的权重,那么将模型检查到特定目录 .
这是一个代码片段,用于说明我的想法:

class SessRunHOOK(tf.train.SessionRunHook):
     ...  
     def end(self, session):
         ...
         # after each call to estimator.evaluate() if
         # the metrics for this model are the best so far
         # save the checkpoint to a specific directory like so
         saver = session.graph.get_collection('savers')[0]
         saver.save(session,
                    join(best_ckpt_dir, 'ckpt'),
                    global_step = global_step,
                    latest_filename = None,
                    meta_graph_suffix = 'meta',
                    write_meta_graph = True,
                    write_state = True,
                    strip_default_attrs = False)

在估算器的模型函数中,钩子实例被传递给EstimatorSpec的train和evaluate钩子参数 .

Estimator实例的创建和使用方式如下:

RUN_CONFIG = tf.estimator.RunConfig(

                    save_summary_steps     = None,
                    log_step_count_steps   = None,
                    keep_checkpoint_max    = 3,
                    save_checkpoints_steps = TRAIN_STEPS,
                    session_config         = None

                )
estimator = tf.estimator.Estimator(

   model_fn  = model_fn,
   params    = ESTIMATOR_PARAMS,
   model_dir = DUMP_DIR,
   config    = RUN_CONFIG 

)

while True:

   estimator.train(train_input_fn, train_steps = TRAIN_STEPS)
   estimator.evaluate(eval_input_fn)

最终发生的事情是,即使估计器实例在运行配置中指定的model_dir中最多保留3个检查点,它们也将堆积临时元文件 . 以下是带有这些临时文件的model_dir的屏幕截图

摆脱它们的唯一方法是重新启动系统,刷新文件浏览器GUI不会有任何区别,导致我认为我正在检查点的方式有问题,如果有人能指出我会很感激那个 .

最终它将最终使用系统上的所有内存,产生以下堆栈跟踪:

2018-10-21 04:07:10.571642:W tensorflow / core / framework / op_kernel.cc:1275] OP_REQUIRES在save_restore_v2_ops.cc:184失败:资源耗尽:/ home / m232 / catalinh / models / orange_juice / dtt / keurig_new_logo_160x160x1_with_negatives /ckpts/model.ckpt-19494.data-00000-of-00001;打开文件太多Traceback(最近一次调用最后一次):文件“/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/client/session.py”,第1278行,在_do_call返回fn(* args)文件“/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/client/session.py”,第1263行,在_run_fn选项中,feed_dict,fetch_list, target_list,run_metadata)文件“/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/client/session.py”,第1350行,在_call_tf_sessionrun run_metadata中)tensorflow.python.framework . errors_impl.ResourceExhaustedError:/home/m232/catalinh/models/orange_juice/dtt/keurig_new_logo_160x160x1_with_negatives/ckpts/model.ckpt-19494.data-00000-of-00001;太多打开的文件[[节点:保存/恢复VS2 = RestoreV2 [dtypes = [DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,...,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT],_ device =“/ job:localhost / replica:0 / task:0 / device:CPU:0“](_ arg_save / Const_0_0,save / RestoreV2 / tensor_names,save / RestoreV2 / shape_and_slices)]提示:如果要在OOM发生时查看已分配的张量列表,将report_tensor_allocations_upon_oom添加到RunOptions以获取当前分配信息 .

[[Node: save/RestoreV2/_149 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_154_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

提示:如果要在OOM发生时查看已分配的张量列表,请将report_tensor_allocations_upon_oom添加到RunOptions以获取当前分配信息 .

在处理上述异常期间,发生了另一个异常:

回溯(最近一次调用最后一次):文件"/home/m232/.virtualenvs/ml/bin/tensor_dyve",第11行,在sys.exit(main())文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/TensorDyve/tensor_dyve.py",第352行,在主estimator.train中(input_fn = train_input_pipe,steps = TRAIN_STEPS)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第376行,在train loss = self._train_model(input_fn,hooks,saving_listeners)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第1145行,在_train_model中返回self._train_model_default(input_fn,hooks,saving_listeners)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第1173行,在_train_model_default saving_listeners中)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第1448行,在_train_with_estimator_spec中log_step_count_steps = self._config.log_step_count_steps)作为mon_sess:文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",线421,在MonitoredTrainingSession stop_grace_period_secs = stop_grace_period_secs)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",线832,在 init stop_grace_period_secs = stop_grace_period_secs)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",线555,在 init self._sess = _RecoverableSession(自._coordinated_creator)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第1018行, init _WrappedSession . init (self,self._create_session())文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第1023行,在_create_session中返回self._sess_creator.create_session()文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第712行,在create_session中self.tf_sess = self.session_creator.create_session()文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第483行,在create_session init_fn = self.scaffold.init_fn)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py",第281行,在prepare_session中config = config)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py",第211行,在_restore_checkpoint saver.restore(sess,ckpt.model_checkpoint_path)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/saver.py",第1725行,在restore {self.saver_def.filename_tensor_name:save_path}中)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/client/session.py",第877行,在run run_metadata_ptr中)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/client/session.py",第1100行,在_run feed_dict_tensor,options,run_metadata中)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/client/session.py",第1272行,在_do_run中运行run_metadata)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/client/session.py",第1291行,在_do_call引发类型(e)(node_def,op,message)tensorflow.python.framework . errors_impl.ResourceExhaustedError:/home/m232/catalinh/models/orange_juice/dtt/keurig_new_logo_160x160x1_with_negatives/ckpts/model.ckpt-19494.data-00000-of-00001;太多打开的文件[[节点:保存/恢复VS2 = RestoreV2 [dtypes = [DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,...,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT], device = "/job:localhost/replica:0/task:0/device:CPU:0"]( arg_save / Const_0_0 ,save / RestoreV2 / tensor_names,save / RestoreV2 / shape_and_slices)]]提示:如果要在OOM发生时查看已分配的张量列表,请将report_tensor_allocations_upon_oom添加到RunOptions以获取当前分配信息 .

[[Node: save/RestoreV2/_149 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_154_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

提示:如果要在OOM发生时查看已分配的张量列表,请将report_tensor_allocations_upon_oom添加到RunOptions以获取当前分配信息 .

由op 'save/RestoreV2'引起,定义于:文件"/home/m232/.virtualenvs/ml/bin/tensor_dyve",第11行,在sys.exit(main())文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/TensorDyve/tensor_dyve.py",第352行,在main estimator.train中(input_fn = train_input_pipe,steps = TRAIN_STEPS)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第376行, train loss = self._train_model(input_fn,hooks,saving_listeners)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第1145行,在_train_model中返回self._train_model_default(input_fn,hooks,saving_listeners)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第1173行,在_train_model_default saving_listeners中)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py",第1448行,在_train_with_estimator_spec中log_step_count_steps = self._config.log_step_count_steps)作为mon_sess:文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",线421,在MonitoredTrainingSession stop_grace_period_secs = stop_grace_period_secs)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",线832,在 init stop_grace_period_secs = stop_grace_period_secs)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",线555,在 init self._sess = _RecoverableSession(自._coordinated_creator)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第1018行,在 init _WrappedSession中 . init (self,self._create_session())文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第1023行,在_create_session中返回self._sess_creator.create_session()文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第712行,在create_session中self.tf_sess = self._session_creator.create_session()文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第474行,在create_session中self._scaffold.finalize()文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py",第214行,在finalize self._saver.build()文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/saver.py",第1293行,在构建self._build(self.filename,build_save = True,build_restore = True)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/saver.py",第1330行,在_build build_save = build_save,build_restore = build_restore)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/saver.py",第772行,在_build_internal restore_sequentially,reshape)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/saver.py",第450行,在_AddShardedRestoreOps中名称= "restore_shard"))文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/saver.py",第397行,在_AddRestoreOps中restore_sequentially)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/training/saver.py",第829行,在bulk_restore中返回io_ops.restore_v2(filename_tensor,名称,切片,dtypes)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py",第1463行,在restore_v2中shape_and_slices = shape_and_slices,dtypes = dtypes,name = name)文件_2852 523,第787行,在_apply_op_helper中op_def = op_def)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py",第454行,在new_func中返回func(* args,** kwargs)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/framework/ops.py",第3155行,在create_op中op_def = op_def)文件"/home/m232/.virtualenvs/ml/lib/python3.6/site-packages/tensorflow/python/framework/ops.py",第1717行,在 init self ._traceback = tf_stack.extract_stack()

ResourceExhaustedError(参见上面的回溯):/ home / m232 / catalinh / model / orange_juice / dtt / keurig_new_logo_160x160x1_with_negatives / checks / model.ckpt-19494.data-00000-of-00001;太多打开的文件[[节点:保存/恢复VS2 = RestoreV2 [dtypes = [DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,...,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT],_ device =“/ job:localhost / replica:0 / task:0 / device:CPU:0“](_ arg_save / Const_0_0,save / RestoreV2 / tensor_names,save / RestoreV2 / shape_and_slices)]提示:如果要在OOM发生时查看已分配的张量列表,将report_tensor_allocations_upon_oom添加到RunOptions以获取当前分配信息 .

[[Node: save/RestoreV2/_149 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_154_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

提示:如果要在OOM发生时查看已分配的张量列表,请将report_tensor_allocations_upon_oom添加到RunOptions以获取当前分配信息 .