我正在Keras Build 一个RNN .

def RNN_keras(max_timestep_len, feat_num):
    model = Sequential()
    model.add(Masking(mask_value=-1.0, input_shape=(max_timestep_len, feat_num)))
    model.add(SimpleRNN(input_dim=feat_num, output_dim=128, activation='relu', return_sequences=True))  
    model.add(Dropout(0.2))
    model.add(TimeDistributed(Dense(output_dim = 1, activation='relu')))

    sgd = SGD(lr=0.1, decay=1e-6)
    model.compile(loss='mean_squared_error',
                  optimizer=sgd,
                  metrics=['mean_squared_error'])
    return model

for epoch in range(1, NUM_EPOCH+1):
    batch_index = 0
    for X_batch, y_batch in mig.Xy_gen(mig.X_train, mig.y_train, batch_size=BATCH_SIZE):
        batch_index += 1

        X_train_pad = sequence.pad_sequences(X_batch, maxlen=mig.ttb.MAX_SEQ_LEN, padding='pre', value=-1.0)
        y_train_pad = sequence.pad_sequences(y_batch, maxlen=mig.ttb.MAX_SEQ_LEN, padding='pre', value=-1.0)
        loss = rnn.train_on_batch(X_train_pad, y_train_pad)
        print("Epoch", epoch, ": Batch", batch_index, "-",
          rnn.metrics_names[0], "=", loss[0], "-", rnn.metrics_names[1], "=", loss[1])

输出:

Epoch 1 : Batch 1 - loss = 715.478 - mean_squared_error = 178.191
Epoch 1 : Batch 2 - loss = 1.32964e+12 - mean_squared_error = 2.7457e+11
Epoch 1 : Batch 3 - loss = 2880.08 - mean_squared_error = 594.089
Epoch 1 : Batch 4 - loss = 4065.16 - mean_squared_error = 1031.27
Epoch 1 : Batch 5 - loss = 3489.96 - mean_squared_error = 695.302
Epoch 1 : Batch 6 - loss = 546.395 - mean_squared_error = 147.439
Epoch 1 : Batch 7 - loss = 1353.35 - mean_squared_error = 241.043
Epoch 1 : Batch 8 - loss = 1962.75 - mean_squared_error = 426.699
Epoch 1 : Batch 9 - loss = 2680.85 - mean_squared_error = 504.812

我的问题:

  • 批次损失没有减少是正常的吗?

  • 我将损失和指标都设置为'mean_squared_error' . 为什么输出的损失和mean_square_error不同?它们是根据不同的训练数据集计算的吗?

  • 我应该如何决定是否需要使用'pre'填充或'post'填充? 'Pre'就像添加'START ', while ' post ' is like adding ' END '. But based on my understanding, both ' START ' and ' END'在序列标记中很重要 . 对?

  • 在TimeDistributed Layer中,Y_t也由y_t-1,y_t-2,...确定?或者它只是密集层的序列版本,其中所有时间步骤的输出是独立的?