如何使用tensorflow的Estimator API控制何时计算评估与训练？-Java 学习之路

tensorflow文档未提供有关如何在评估集上执行模型定期评估的任何示例

接受的答案建议使用Experiment（根据this README弃用） .

我在网上找到了所有使用train_and_evaluate方法的点 . 但是，我仍然没有看到如何在两个过程之间切换（训练和评估） . 我尝试过以下方法：

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    params=hparams,
    model_dir=model_dir,
    config = tf.estimator.RunConfig(
        save_checkpoints_steps = 2000,
        save_summary_steps = 100,
        keep_checkpoint_max=5
    )
)

train_input_fn = lambda: input_fn(
    train_file, #a .tfrecords file
    train=True,
    batch_size=70,
    num_epochs=100
)

eval_input_fn = lambda: input_fn(
    val_file, # another .tfrecords file
    train=False,
    batch_size=70,
    num_epochs=1
)
train_spec = tf.estimator.TrainSpec(
    train_input_fn,
    max_steps=125
)    

eval_spec = tf.estimator.EvalSpec(
    eval_input_fn,
    steps=30,
    name='validation',
    start_delay_secs=150,
    throttle_secs=200
)

tf.logging.info("start experiment...")
tf.estimator.train_and_evaluate(
    estimator,
    train_spec,
    eval_spec
)

以下是我认为我的代码应该做的事情：

使用70的批量训练模型100个时期;每2000批次保存检查点;每100批保存摘要;最多保留5个检查站;在训练集上150批次之后，使用30批验证数据计算验证错误

但是，我得到以下日志：

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
INFO:tensorflow:loss = 39.55082, step = 1
INFO:tensorflow:global_step/sec: 178.622
INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec)
INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.8327793.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [3/30]
INFO:tensorflow:Evaluation [6/30]
INFO:tensorflow:Evaluation [9/30]
INFO:tensorflow:Evaluation [12/30]
INFO:tensorflow:Evaluation [15/30]
INFO:tensorflow:Evaluation [18/30]
INFO:tensorflow:Evaluation [21/30]
INFO:tensorflow:Evaluation [24/30]
INFO:tensorflow:Evaluation [27/30]
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387

从日志开始，似乎训练在第一个评估步骤后停止 . 我在文档中遗漏了什么？你能解释一下我应该如何实现我认为我的代码在做什么？

附加信息我使用MNIST数据集运行一切，在训练集中有50,000个图像，所以（我认为）模型应该运行* num_epochs * 50,000 /batch_size≃7,000步*

我真诚地感谢你的帮助！

编辑：运行实验后，我意识到max_steps控制整个训练过程的步骤数，而不仅仅是计算测试集上的度量标准之前的步骤数 . 阅读tf.estimator.Estimator.train，我看到它有一个步骤参数，它以增量方式工作，并以max_steps为界;但是，tf.estimator.TrainSpec没有steps参数，这意味着我无法控制在验证集上计算度量标准之前要采取的步骤数 .

2 回答

事实上，每200秒或训练结束时，估算人员将从训练阶段切换到评估阶段 .

但是，我们可以在您的代码中看到您在评估开始之前能够完成125个步骤，这意味着您的培训已经完成 . max_steps是您在停止之前重复训练的时间，有任何与历元数量相关的链接（因为它没有在tf.estimator.train_and_evaluate中使用） . 在培训期间，您的评估指标将出现每个throttle_secs（此处为= 200） .

关于指标，您可以在模型中添加以下内容：

predict = tf.nn.softmax(logits, name="softmax_tensor")
classes = tf.cast(tf.argmax(predict, 1), tf.uint8)

def conv_model_eval_metrics(classes, labels, mode):
    if mode == tf.estimator.ModeKeys.TRAIN or mode == tf.estimator.ModeKeys.EVAL:
        return {
            'accuracy': tf.metrics.accuracy(classes, labels),
            'precision': tf.metrics.precision(classes, labels),
            'recall': tf.metrics.recall(classes, labels),
        }
    else:
        return None

eval_metrics = conv_model_eval_metrics(classes, labels, mode)
with tf.variable_scope("performance_metrics"):
    #Accuracy is the most intuitive performance measure and it is simply a
        #ratio of correctly predicted observation to the total observations.
    tf.summary.scalar('accuracy', eval_metrics['accuracy'][1])

    #How many selected items are relevant
    #Precision is the ratio of correctly predicted positive observations to
        #the total predicted positive observations.
    tf.summary.scalar('precision', eval_metrics['precision'][1])

    #How many relevant items are selected
    #Recall is the ratio of correctly predicted positive observations to
        #the all observations in actual class
    tf.summary.scalar('recall', eval_metrics['recall'][1])

在训练和评估过程中遵循张量板的精度，召回率和准确性非常有效 .

PS：对不起，这是我的第一个回答，这就是为什么阅读它是非常恶心的^^

回复于 2024-04-29T16:09:02+08:00

1

可以通过输入_fn（）中的一个tf.data.Dataset.repeat（num_epochs）来控制重复 . 训练函数将一直运行，直到消耗了纪元数，然后运行评估函数，然后训练函数将再次运行，直到纪元数，等等;最后，当达到TrainSpec中的max_steps定义时，train_and_eval方法将停止 .

这是我从一些实验得出的结论，欢迎更正 .

回复于 2024-04-29T16:09:02+08:00

如何使用tensorflow的Estimator API控制何时计算评估与训练？

2 回答

相关问题