首页 文章

如何将渐变列表或(渐变,变量名称)对提供给我的模型

提问于
浏览
1

这与前一个问题有关:How to partition a single batch into many invocations to save memory,也与How to train a big model with relatively large batch size on a single GPU using Tensorflow?有关;但是,我还是找不到确切的答案 . 例如,另一个相关问题的回答没有被接受,并且没有其他评论 .

我想尝试模拟更大的批量大小,但只使用一个GPU . 因此,我需要为每个较小的批次计算渐变,在几个这样的较小批次中聚合/平均它们,然后才应用 .

(基本上,它就像同步分布式SGD,但在单个设备/ GPU上,串行执行 . 当然,分布式SGD的加速优势会丢失,但更大的批量大小本身可能会使收敛更大的精度和更大的步长,如图所示最近几篇论文 . )

为了保持低内存要求,我应该使用小批量标准SGD,在每次迭代后更新渐变,然后调用 optimizer.apply_gradients() (其中 optimizer 是已实现的优化器之一) .

所以,一切看起来都很简单但是当我去实现它时,它实际上并非如此微不足道 .

例如,我想使用一个 Graph ,计算每次迭代的渐变,然后,当处理多个批次时,将渐变相加并将它们传递给我的模型 . 但是列表本身无法输入 sess.runfeed_dict 参数 . 另外,直接传递渐变并不完全正常,我得到 TypeError: unhashable type: 'numpy.ndarray' (我认为原因是我无法传递 numpy.ndarray ,只有张量流变量) . 我可以为渐变定义一个占位符,但为此我需要首先构建模型(指定可训练的变量等) .

总而言之,请告诉我有一种更简单的方法来实现它 .

2 回答

  • 2

    您需要将渐变作为传递给 apply_gradients 的值 . 它可以是占位符,但使用通常的 compute_gradients / apply_gradients 组合可能更容易:

    # Some loss measure
    loss = ...
    optimizer = ...
    gradients = optimizer.compute_gradients(loss)
    # gradients is a list of pairs
    _, gradient_tensors = zip(*gradients)
    # Apply gradients as usual
    train_op = optimizer.apply_gradients(gradients)
    
    # On training
    # Compute some gradients
    gradient_values = session.run(gradient_tensors, feed_dict={...})
    # gradient_values is a sequence of numpy arrays with gradients
    
    # After averaging multiple evaluations of gradient_values apply them
    session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
    

    如果你想在TensorFlow中计算渐变的平均值,那还需要一些专门针对它的额外代码,可能是这样的:

    # Some loss measure
    loss = ...
    optimizer = ...
    gradients = optimizer.compute_gradients(loss)
    # gradients is a list of pairs
    _, gradient_tensors = zip(*gradients)
    # Apply gradients as usual
    train_op = optimizer.apply_gradients(gradients)
    
    # Additional operations for gradient averaging
    gradient_placeholders = [tf.placeholder(t.dtype, (None,) + t.shape)
                             for t in gradient_tensors]
    gradient_averages = [tf.reduce_mean(p, axis=0) for p in gradient_placeholders]
    
    # On training
    gradient_values = None
    # Compute some gradients
    for ...:  # Repeat for each small batch
        gradient_values_current = session.run(gradient_tensors, feed_dict={...})
        if gradient_values is None:
            gradient_values = [[g] for g in gradient_values_current]
        else:
            for g_list, g in zip(gradient_values, gradient_values_current):
                g_list.append(g)
    # Stack gradients
    gradient_values = [np.stack(g_list) for g_list in gradient_values)
    # Compute averages
    gradient_values_average = session.run(
        gradient_averages, feed_dict=dict(zip(gradient_placeholders, gradient_values)))
    
    # After averaging multiple gradients apply them
    session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
    
  • 1

    没有比你已经被告知过的更简单的方法了 . 这种方式起初可能看起来很复杂,但它实际上非常简单 . 您只需使用低级API手动计算每个批次的渐变,平均值,然后手动将平均渐变量提供给优化器以应用它们 .

    我将尝试提供一些如何执行此操作的精简代码 . 我将使用点作为实际代码的占位符,这取决于问题 . 你通常会做的是这样的事情:

    import tensorflow as tf
    [...]
    input = tf.placeholder(...)
    [...]
    loss = ...
    [...]
    # initialize the optimizer
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
    # define operation to apply the gradients
    minimize = optimizer.minimize(loss)
    [...]
    if __name__ == '__main__':
        session = tf.Session(config=CONFIG)
        session.run(tf.global_variables_initializer())
        for step in range(1, MAX_STEPS + 1):
            data = ...
            loss = session.run([minimize, loss],
                               feed_dict={input: data})[1]
    

    你现在想做什么,将多个批次平均到保存内存将是这样的:

    import tensorflow as tf
    [...]
    input = tf.placeholder(...)
    [...]
    loss = ...
    [...]
    # initialize the optimizer
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
    
    # grab all trainable variables
    trainable_variables = tf.trainable_variables()
    
    # define variables to save the gradients in each batch
    accumulated_gradients = [tf.Variable(tf.zeros_like(tv.initialized_value()),
                                         trainable=False) for tv in
                             trainable_variables]
    
    # define operation to reset the accumulated gradients to zero
    reset_gradients = [gradient.assign(tf.zeros_like(gradient)) for gradient in
                       accumulated_gradients]
    
    # compute the gradients
    gradients = optimizer.compute_gradients(loss, trainable_variables)
    
    # Note: Gradients is a list of tuples containing the gradient and the
    # corresponding variable so gradient[0] is the actual gradient. Also divide
    # the gradients by BATCHES_PER_STEP so the learning rate still refers to
    # steps not batches.
    
    # define operation to evaluate a batch and accumulate the gradients
    evaluate_batch = [
        accumulated_gradient.assign_add(gradient[0]/BATCHES_PER_STEP)
        for accumulated_gradient, gradient in zip(accumulated_gradients,
                                                  gradients)]
    
    # define operation to apply the gradients
    apply_gradients = optimizer.apply_gradients([
        (accumulated_gradient, gradient[1]) for accumulated_gradient, gradient
        in zip(accumulated_gradients, gradients)])
    
    # define variable and operations to track the average batch loss
    average_loss = tf.Variable(0., trainable=False)
    update_loss = average_loss.assign_add(loss/BATCHES_PER_STEP)
    reset_loss = average_loss.assign(0.)
    [...]
    if __name__ == '__main__':
        session = tf.Session(config=CONFIG)
        session.run(tf.global_variables_initializer())
    
        data = [batch_data[i] for i in range(BATCHES_PER_STEP)]
        for batch_data in data:
            session.run([evaluate_batch, update_loss],
                        feed_dict={input: batch_data})
    
        # apply accumulated gradients
        session.run(apply_gradients)
    
        # get loss
        loss = session.run(average_loss)
    
        # reset variables for next step
        session.run([reset_gradients, reset_loss])
    

    如果你填补空白,这应该是可以运行的 . 但是,我可能在将它剥离并粘贴在这里时犯了一个错误 . 对于一个可运行的例子,你可以看看project我目前正在自己工作 .

    我还想明确指出,这与一次评估所有批次数据的损失不同,因为您对梯度进行平均 . 当您的损失不适用于低统计数据时,这一点尤为重要 . 以直方图的卡方为例,计算具有低仓数的直方图的平方梯度将不如仅在一个直方图上计算梯度,并且所有仓一次填满 .

相关问题