我正在尝试使用tensorflow 1.8.0训练CNN . 这是一个相当大的网络,但我正在使用“批处理”在GPU上执行培训 . 但即使我在训练期间给出有效的批量大小,张量流也会因OOM而失败 . 当保持批量大小相同的值减少训练数据集时,tensorflow成功启动训练过程 .

我不明白的是,当训练样本数增加且批量值保持不变时,为什么tensorflow会给OOM . 批处理用于限制每次迭代的训练样本以使数据适合GPU存储器 .

注意:所有数据集都保存在内存中,所有样本都适合主机内存 .

培训图表创建

# First input dimension is given as None to adapt batching size
x = tf.placeholder(dtype=training_data.dtype,shape=(tuple([None])+training_data.shape[1:]))
# First input dimension is given as None to adapt batching size
y_ = tf.placeholder(dtype=training_label.dtype,shape=(tuple([None])+training_label.shape[1:]))

tf.data.Dataset.from_tensor_slices((x,y_)).shuffle(buffer_size=1000)
.repeat(num_epochs).batch(batch_size).make_initializable_iterator()

images, labels = dataset_iter.get_next()

######## A training model is created here using TF.Slim on GPU ########
tensor_outputs = create_model(images)
#######################################################################

cross_entropy_l = loss_func(tensor_outputs, labels)
cross_entropy = tf.reduce_mean(cross_entropy_l)
optimizer = tf.train.AdamOptimizer(learning_rate);
train_step = create_train_op(cross_entropy, optimizer, global_step=global_step)

数据集初始化

hooks = [InitializerHook(training_nn_params),StopAtStepHook(required_steps)]
with tf.train.MonitoredTrainingSession(hooks=hooks) as monitored_sess:
    while not monitored_sess.should_stop():
        monitored_sess.run([train_step])

初始化程序钩子

class InitializerHook(tf.train.SessionRunHook):

def __init__(self, training_nn_params):
    self.training_nn_params = training_nn_params

def after_create_session(self, session, coord):
    session.run(self.training_nn_params.input_iterator.initializer,
                feed_dict={self.training_nn_params.x: self.training_nn_params.data,
                           self.training_nn_params.y_: self.training_nn_params.labels})

注意:training_nn_params.data和training_nn_params.labels包括所有训练数据 .

我认为tensorflow尝试创建图形而不是根据批量大小维度,而是根据所有Feed样本大小 . 但我不确定 . 我怎么解决这个问题 ?