Problem
-
训练自定义张量流1.11
tf.estimator.Estimator
与tf.data.Dataset
的运行速度比使用相同模型架构的tf.keras
并直接输入数据要慢得多 -
但是,它有时会快速运行(以global_step / sec表示),但在纪元开始和结束时速度很慢 .
在"fast"批次期间 -
,GPU util约为30% . 在慢速期间,约1%
Probable causes
- 我的输入管道是't well-optimized enough so that GPU is idle, waiting on the CPU data processing. But I don' t了解为什么那只是在剧集结束前的最后几个小批量
What have I tried
按照Input Pipeline Performance Guide中的步骤操作 . 这加速了不接近时代边界的批次 . 我不确定如何进一步改进 .
Minimal working example
import numpy as np
import pandas as pd
import tensorflow as tf
train_data = pd.DataFrame(np.random.randn(1030255, 1021)).rename(columns={c:str(c) for c in range(1021)})
train_target = pd.DataFrame(np.round(np.random.rand(1030255, 15))).rename(columns={c:str(c) for c in range(15)})
val_data = pd.DataFrame(np.random.randn(491077, 1021)).rename(columns={c:str(c) for c in range(1021)})
val_target = pd.DataFrame(np.round(np.random.rand(491077, 15))).rename(columns={c:str(c) for c in range(15)})
def model_fn(features, labels, mode,params):
net = tf.feature_column.input_layer(features, params['feature_columns'])
net = tf.layers.dense(net, units=params['hidden_units'], activation=tf.nn.tanh)
logits = tf.layers.dense(net, params['outputs'], activation=None)
loss = tf.losses.sigmoid_cross_entropy(labels, logits=logits)
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(mode, loss=loss)
elif mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
cols = [str(c) for c in np.sort(np.random.choice(1021, size=98, replace=False))]
target_cols = [str(c) for c in np.arange(11)]
VALIDATION_BATCH_SIZE = int(val_data.shape[0] / 4.0)
BATCH_SIZE = 2**10
def train_input_fn():
features = {k: train_data[k].values for k in cols}
dataset = tf.data.Dataset.from_tensor_slices((features, train_target[target_cols].values))
dataset = dataset.repeat() \
.shuffle(train_data.shape[0]) \
.batch(BATCH_SIZE) \
.prefetch(BATCH_SIZE)
return dataset
def validation_input_fn():
features = {k: val_data[k].values for k in cols}
dataset = tf.data.Dataset.from_tensor_slices((features, val_target[target_cols].values))
dataset = dataset.repeat() \
.batch(VALIDATION_BATCH_SIZE) \
.prefetch(VALIDATION_BATCH_SIZE)
return dataset
feature_columns = [tf.feature_column.numeric_column(f) for f in cols]
run_cfg = tf.estimator.RunConfig(tf_random_seed=1, save_checkpoints_steps=1000, save_checkpoints_secs=None)
classifier = tf.estimator.Estimator(
model_fn=model_fn,
config=run_cfg,
params={
'feature_columns': feature_columns,
'hidden_units': 128,
'outputs': len(target_cols)
})
tf.estimator.train_and_evaluate(
classifier,
train_spec=tf.estimator.TrainSpec(train_input_fn),
eval_spec=tf.estimator.EvalSpec(validation_input_fn, steps=4, start_delay_secs=30, throttle_secs=30))
Sample output
(第一步所用的时间被误认为是巨大的,因为tensorflow与上一个训练步骤有所不同 . 但是,在开始eval之前的最后一个训练步骤不是这种情况 . )
INFO:tensorflow:Finished evaluation at 2018-10-31-08:17:35
INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 0.694723
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: /tmp/tmpvQq9DV/model.ckpt-1000
INFO:tensorflow:global_step/sec: 0.930553
INFO:tensorflow:loss = 0.69407356, step = 1000 (107.463 sec)
INFO:tensorflow:global_step/sec: 171.673
INFO:tensorflow:loss = 0.69456786, step = 1100 (0.583 sec)
INFO:tensorflow:global_step/sec: 166.964
INFO:tensorflow:loss = 0.69411445, step = 1200 (0.599 sec)
INFO:tensorflow:global_step/sec: 172.226
INFO:tensorflow:loss = 0.6940959, step = 1300 (0.579 sec)
INFO:tensorflow:global_step/sec: 170.882
INFO:tensorflow:loss = 0.69440323, step = 1400 (0.586 sec)
INFO:tensorflow:global_step/sec: 173.453
INFO:tensorflow:loss = 0.69332886, step = 1500 (0.577 sec)
INFO:tensorflow:global_step/sec: 167.078
INFO:tensorflow:loss = 0.6950055, step = 1600 (0.598 sec)
INFO:tensorflow:global_step/sec: 159.763
INFO:tensorflow:loss = 0.69460225, step = 1700 (0.626 sec)
INFO:tensorflow:global_step/sec: 161.674
INFO:tensorflow:loss = 0.6940766, step = 1800 (0.617 sec)
INFO:tensorflow:global_step/sec: 8.83793
INFO:tensorflow:loss = 0.6936994, step = 1900 (11.315 sec)
INFO:tensorflow:Saving checkpoints for 2000 into /tmp/tmpvQq9DV/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-31-08:18:11
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpvQq9DV/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/4]
INFO:tensorflow:Evaluation [2/4]
INFO:tensorflow:Evaluation [3/4]
INFO:tensorflow:Evaluation [4/4]
INFO:tensorflow:Finished evaluation at 2018-10-31-08:19:36
Without eval step
我认为eval步骤(其本身非常缓慢)可能是问题所在 . 但是当我只训练时,它运行得更慢:
classifier.train(input_fn=train_input_fn, steps=5000)
[...]
INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmpIJctNp/model.ckpt.
INFO:tensorflow:global_step/sec: 6.95414
INFO:tensorflow:loss = 0.6948092, step = 1000 (14.381 sec)
INFO:tensorflow:global_step/sec: 12.0491
INFO:tensorflow:loss = 0.6942879, step = 1100 (8.298 sec)
INFO:tensorflow:global_step/sec: 8.98388
INFO:tensorflow:loss = 0.6939402, step = 1200 (11.131 sec)
INFO:tensorflow:global_step/sec: 8.86219
INFO:tensorflow:loss = 0.6946343, step = 1300 (11.284 sec)
INFO:tensorflow:global_step/sec: 8.74248
INFO:tensorflow:loss = 0.694865, step = 1400 (11.439 sec)
在此先感谢您的帮助 .