使用Keras进行无分段手写文本识别-Java 学习之路

我目前正在开发一个无分段手写文本识别应用程序 . 因此，从输入文档中提取文本行，然后应该识别该文本行 .

出于开发目的，我使用IAM Handwriting Database . 它提供文本行图像以及相应的ASCII文本 .

为了表彰，我调整了论文“An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition " and " Can We Build Language-independent OCR Using LSTM Networks?”中的方法 .

基本上，我使用双向GRU架构和前向后向算法来将成绩单与神经网络的输出对齐 .

数据库中的图像如下所示：
enter image description here

图像呈现为像素值的1D序列，更准确地说，图像首先被缩放到32像素的高度 .
尺寸为597 x 32的上述图像的凹凸阵列具有以下形状：（597,32） .
表示大小为n的整体训练图像的numpy阵列具有以下形状：（n，w，32）其中w表示线图像的可变宽度（例如597） .

以下代码显示了训练图像和转录的表示方式：

x_train = []
y_train = []
line_height_normalized = 32
for i in range(sample_size):
    transcription_train, image_train = self._get_next_sample()
    image_train = convert_to_grayscale(image_train)
    image_train = scale_y(image_train, line_height_normalized)
    image_train_patches = sklearn_image.extract_patches_2d(image_train, (line_height_normalized, 1))   
    image_train_patches = numpy.reshape(image_train_patches, (image_train_patches.shape[0], -1))
    x_train.append(image_train_patches)
    y_train.append(transcription_train)

我使用Keras并且创建了递归神经网络和CTC函数基于this example .

charset = 68
number_of_memory_units = 512
time_steps = None
input_dimension = 32  # the height of a text line in pixel

# input shape see https://github.com/keras-team/keras/issues/3683
network_input = Input(name="input", shape=(time_steps, input_dimension))  

gru_layer_1 = GRU(number_of_memory_units, return_sequences=True, kernel_initializer='he_normal',
                  name='gru_layer_1')(network_input)
gru_layer_1_backwards = GRU(number_of_memory_units, return_sequences=True, go_backwards=True,
                  kernel_initializer='he_normal',name='gru_layer_1_backwards')(network_input)
gru_layer_1_merged = add([gru_layer_1, gru_layer_1_backwards])
gru_layer_2 = GRU(number_of_memory_units, return_sequences=True, kernel_initializer='he_normal',
                  name='gru_layer_2')(gru_layer_1_merged)
gru_layer_2_backwards = GRU(number_of_memory_units, return_sequences=True, go_backwards=True, kernel_initializer='he_normal',
                  name='gru_layer_2_backwards')(gru_layer_1_merged)

output_layer = Dense(charset, kernel_initializer='he_normal',
                  name='dense_layer')(concatenate([gru_layer_2, gru_layer_2_backwards]))
prediction = Activation('softmax', name='output_to_ctc')(output_layer)

# create the ctc layer
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
max_line_length = 200  # see QUESTION 1
labels = Input(name='labels', shape=[max_line_length], dtype='float32')
loss_out = Lambda(RecurrentNeuralNetwork._ctc_function, name='ctc')(
        [prediction, labels, input_length, label_length])
model = Model(inputs=[network_input, labels, input_length, label_length], outputs=loss_out)

sgd = SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5)
model.compile(loss={'ctc': lambda l_truth, l_prediction: prediction}, optimizer=sgd)

Question 1
在该示例中，使用max_line_length;正如我在互联网上阅读的那样（但我认为我不理解它很好），因为基础CTC功能需要知道应该创建多少个张量，所以需要最大行长度 .
什么长度适合可变线长度，这如何影响对看不见的文本行的识别？
此外，input_length变量和label_length变量究竟代表什么？

在下一步中，网络将受到培训：

batch_size = 1  
number_of_epochs = 4 

size = 32  # line height? see QUESTION 2
input_length = numpy.zeros([size, 1])
label_length = numpy.zeros([size, 1])
for epoch in range(number_of_epochs):
    for x_train_batch, y_train_batch in zip(x_train, y_train_labels):
        x_train_batch = numpy.reshape(x_train_batch, (1, len(x_train_batch), 32))
        inputs = {'input': x_train_batch, 'labels': numpy.array(y_train_batch),
                      'input_length': input_length, 'label_length': label_length}
        outputs = {'ctc': numpy.zeros([size])}  # dummy data for dummy loss function
        self.model.fit(x=inputs, y=outputs, batch_size=batch_size, epochs=1, shuffle=False)
        self.model.reset_states()

由于时间步长具有可变长度（文本行的宽度），因此它以1号批量进行训练 .
文本行的转录由numpy数组y_train_batch表示;每个字符都是数字编码的 .
上面的图像示例的转录如下所示：

[26 62 38 40 47 30 62 19 14 62 18 19 14 15 62 38 17 64 62 32  0  8 19 18 10  4 11 11 62  5 17 14 12]

Question 2
size变量代表什么？它是signle图像补丁的尺寸，因此是每个时间步的特征吗？

Errors
发生的错误如下：

预期标签有形状（200，）但是有形状的数组（1，）
是否有必要填充标签数组以包含200个元素？

当我将max_line_length的值替换为1时，会发生下一个错误：

所有输入数组（x）应具有相同数量的样本 . 得到数组形状：[（1,597,32），（33,1），（32,1），（32,1）]
是否有必要重塑其他三个阵列？
我不是什么意思"right"解决这个问题以及可能出现的下一个错误？

也许有人可以指出我正确的方向 .
非常感谢你！

1 回答

2

好吧，我无法用评论部分提供的600个字符来解释这一点，因此我会通过回答来做到这一点，但无视你的Q2 .

您提到的论文代码可以在以下位置找到：https://github.com/bgshih/crnn这是手写文本识别的良好起点 . 但是，CRNN实现可识别字级别的文本，您希望在行级别上执行此操作，因此您需要更大的输入图像，例如我使用800x64px，最大文本长度为100.正如已经说过的，将图像拉伸到所需的大小并不是很好，在我的实验中，使用填充时准确性增加（稍微随机化位置......这是一种简单的方法做数据增加） .

最大文本长度L与输入图像宽度W之间存在关系：神经网络（NN）将输入图像缩小固定比例因子f：L = W / f（在我的示例中：W = 800px，L = 100，f = 8） . 附图显示了输入图像（800x64px）和字符概率矩阵（100个时间步长中每一个的80个可能字符中的每一个的概率） . NN将输入图像映射到该字符概率矩阵，该矩阵用作CTC的输入 . 由于矩阵中有很多L个时间步长，最多可以有L个字符：这当然适用于解码，但是损失计算必须以某种方式将地面真实文本与此矩阵对齐，以及如何使用L来处理文本1个字符只与矩阵中包含的L个时间步长对齐！？请注意，在CTC计算中，重复的字符（如“piZZa”）必须用特殊字符分隔 - 因此，每次重复可能的文本长度减少1 .

我认为通过这种解释，您应该能够弄清楚代码中的所有长度变量是如何相互关联的 .

回复于 2024-04-29T22:46:01+08:00

使用Keras进行无分段手写文本识别

1 回答

相关问题