对于RNN实施，Tensorflow比Theano慢170倍-Java 学习之路

我试图在Tensorflow（0.11）中实现RNN，基于this论文 .

他们有一个Theano实现here，我正在比较我的实现 . 当我尝试运行他们的Theano实现时，它在大约1小时内完成了10个时期 . 我的Tensorflow实施需要大约17个小时才能完成1个纪元 . 我想知道是否有人可以查看我的代码并告诉我是否有一些明显的问题会减慢它的速度 .

RNN的目的是根据之前的点击次数预测用户要点击的下一个项目 . 这些项目由唯一ID表示，这些ID作为输入提供给RNN作为1-HOT向量 .

所以RNN是这样构建的：

[INPUT (1-HOT representation, size 37803)] -> [GRU layer (size 100)] -> [FeedForward layer]

并且FF层的输出是一个与输入向量大小相同的向量，其中高值表示对应于该索引的项很可能是下一个单击的项 .

num_hidden = 100

x = tf.placeholder(tf.float32, [None, max_length, n_items], name="InputX")
y = tf.placeholder(tf.float32, [None, max_length, n_items], name="TargetY")
session_length = tf.placeholder(tf.int32, [None], name="SeqLenOfInput")

output, state = rnn.dynamic_rnn(
    rnn_cell.GRUCell(num_hidden),
    x,
    dtype=tf.float32,
    sequence_length=session_length
    )

layer = {'weights':tf.Variable(tf.random_normal([num_hidden, n_items])),
         'biases':tf.Variable(tf.random_normal([n_items]))}

output = tf.reshape(output, [-1, num_hidden])
prediction = tf.matmul(output, layer['weights'])

y_flat = tf.reshape(y, [-1, n_items])

final_output = tf.nn.softmax_cross_entropy_with_logits(prediction,y_flat)

cost = tf.reduce_sum(final_output)
optimizer = tf.train.AdamOptimizer().minimize(cost)

两种实现都在相同的硬件上进行测试 . 两种实现都使用GPU .

编辑：Theano模型具有相同的结构 . （1-HOT输入 - > GRU层有100个单位 - > FeedForward）我测试了Theano版本，其参数与我在模型中使用的相同（使用交叉熵进行损失，批量大小= 200，adam优化器，使用相同的学习率，两种模型都没有辍学）但速度差异仍然相同 .

编辑（2016-12-07）：使用文件队列排队批次而不是使用feed_dict帮助很多 . 我仍然需要进行其他优化以使其更快 . 无论如何，这是我如何使用文件队列来加快速度 .

# Create filename_queue
filename_queue = tf.train.string_input_producer(train_files, shuffle=True)

min_after_dequeue = 1024
capacity          = min_after_dequeue + 3*batch_size
examples_queue = tf.RandomShuffleQueue(
        capacity=capacity,
        min_after_dequeue=min_after_dequeue,
        dtypes=[tf.string])

# Create multiple readers to populate the queue of examples
enqueue_ops = []
for i in range(n_readers):
    reader = tf.TextLineReader()
    _key, value = reader.read(filename_queue)
    enqueue_ops.append(examples_queue.enqueue([value]))

tf.train.queue_runner.add_queue_runner(
        tf.train.queue_runner.QueueRunner(examples_queue, enqueue_ops))
example_string = examples_queue.dequeue()

# Default values, and type of the columns, first is sequence_length
# +1 since first field is sequence length
record_defaults = [[0]]*(max_sequence_length+1)

enqueue_examples = []
for thread_id in range(n_preprocess_threads):
    example = tf.decode_csv(value, record_defaults=record_defaults)

    # Split the row into input/target values
    sequence_length = example[0]
    features = example[1:-1]
    targets  = example[2:]

    enqueue_examples.append([sequence_length, features, targets])

# Batch together examples
session_length, x_unparsed, y_unparsed = tf.train.batch_join(
        enqueue_examples, 
        batch_size=batch_size,
        capacity=2*n_preprocess_threads*batch_size)


# Parse the examples in a batch
x = tf.one_hot(x_unparsed, depth=n_classes)
y = tf.one_hot(y_unparsed, depth=n_classes)

# From here on, x, y and session_length can be used in the model

对于RNN实施，Tensorflow比Theano慢170倍

相关问题