我对机器学习很陌生,作为一项学习练习,我试图在CNTK中实现卷积递归神经网络,以识别图像中的可变长度文本。基本思路是获取CNN的输出,从中输出序列并将其馈送到RNN,然后使用CTC作为丢失函数。我遵循'CNTK 208:训练声学模型与连接主义时间分类(CTC)标准'教程,其中显示了CTC使用的基础知识。不幸的是,在训练过程中,我的网络只会输出空白标签,而不是其他任何东西,因为出于某种原因,它会产生最小的损失。
我正在为我的网络提供尺寸为(1,32,96)的图像,我会动态生成它们以显示一些随机字母。作为标签,我给它一个热编码字母的序列,CTC在索引0处需要空白(这都是numpy数组,因为我使用自定义数据加载)。我发现,要使 forward_backward()函数工作,我需要确保它的两个输入使用相同长度的相同动态轴,这是通过使我的标签字符串与网络长度相同来实现的输出长度,并在下面的代码中使用 to_sequence_like()(我不知道如何做得更好,这里使用 to_sequence_like()的副作用是我在评估此模型时需要传递虚拟标签数据。)
alphabet = "0123456789abcdefghijklmnopqrstuvwxyz"
input_dim_model = (1, 32, 96) # images are 96 x 32 with 1 channel of color (gray)
num_output_classes = len(alphabet) + 1
ltsm_hidden = 256
def bidirectionalLTSM(features, nHidden, nOut):
a = C.layers.Recurrence(C.layers.LSTM(nHidden))(features)
b = C.layers.Recurrence(C.layers.LSTM(nHidden), go_backwards=True)(features)
c = C.splice(a, b)
r = C.layers.Dense(nOut)(c)
return r
def create_model_rnn(features):
h = features
h = bidirectionalLTSM(h, ltsm_hidden, ltsm_hidden)
h = bidirectionalLTSM(h, ltsm_hidden, num_output_classes)
return h
def create_model_cnn(features):
with C.layers.default_options(init=C.glorot_uniform(), activation=C.relu):
h = features
h = C.layers.Convolution2D(filter_shape=(3,3),
num_filters=64,
strides=(1,1),
pad=True, name='conv_0')(h)
#more layers...
h = C.layers.BatchNormalization(name="batchnorm_6")(h)
return h
x = C.input_variable(input_dim_model, name="x")
label = C.sequence.input((num_output_classes), name="y")
def create_model(features):
#Composite(x: Tensor[1,32,96]) -> Tensor[512,1,23]
a = create_model_cnn(features)
a = C.reshape(a, (512, 23))
#Composite(x: Tensor[1,32,96]) -> Tensor[23,512]
a = C.swapaxes(a, 0, 1)
#is there a better way to convert to sequence and still be compatible with forward_backwards() ?
#Composite(x: Tensor[1,32,96], y: Sequence[Tensor[37]]) -> Sequence[Tensor[512]]
a = C.to_sequence_like(a, label)
#Composite(x: Tensor[1,32,96], y: Sequence[Tensor[37]]) -> Sequence[Tensor[37]]
a = create_model_rnn(a)
return a
#Composite(x: Tensor[1,32,96], y: Sequence[Tensor[37]]) -> Sequence[Tensor[37]]
z = create_model(x)
#LabelsToGraph(y: Sequence[Tensor[37]]) -> Sequence[Tensor[37]]
graph = C.labels_to_graph(label)
#Composite(y: Sequence[Tensor[37]], x: Tensor[1,32,96]) -> np.float32
criteria = C.forward_backward(C.labels_to_graph(label), z, blankTokenId=0)
err = C.edit_distance_error(z, label, squashInputs=True, tokensToIgnore=[0])
lr = C.learning_rate_schedule(0.01, C.UnitType.sample)
learner = C.adadelta(z.parameters, lr)
progress_printer = C.logging.progress_print.ProgressPrinter(50, first=10, tag='Training')
trainer = C.Trainer(z, (criteria, err), learner, progress_writers=[progress_printer])
#some more custom code ...
#below is how I'm feeding the data
while True:
x1, y1 = custom_datareader.next_minibatch()
#x1 is a list of numpy arrays containing training images
#y1 is a list of numpy arrays with one hot encoded labels
trainer.train_minibatch({x: x1, label: y1})
网络收敛很快,虽然不是我想要的地方(左侧是网络输出,在我给它的右侧标签上):
Minibatch[ 11- 50]: loss = 3.506087 * 58880, metric = 176.23% * 58880;
lllll--55leym---------- => lllll--55leym----------, gt: aaaaaaaaaaaaaaaaaaaayox
-------bbccaqqqyyyryy-q => -------bbccaqqqyyyryy-q, gt: AAAAAAAAAAAAAAAAAAAJPTA
tt22yye------yqqqtll--- => tt22yye------yqqqtll---, gt: tttttttttttttttttttyliy
ceeeeeeee----eqqqqqqe-q => ceeeeeeee----eqqqqqqe-q, gt: sssssssssssssssssssskht
--tc22222al55a5qqqaa--q => --tc22222al55a5qqqaa--q, gt: cccccccccccccccccccaooa
yyyyyyiqaaacy---------- => yyyyyyiqaaacy----------, gt: cccccccccccccccccccxyty
mcccyya----------y---qq => mcccyya----------y---qq, gt: ppppppppppppppppppptjnj
ylncyyyy--------yy--t-y => ylncyyyy--------yy--t-y, gt: sssssssssssssssssssyusl
tt555555ccc------------ => tt555555ccc------------, gt: jjjjjjjjjjjjjjjjjjjmyss
-------eeeaadaaa------5 => -------eeeaadaaa------5, gt: fffffffffffffffffffciya
eennnnemmtmmy--------qy => eennnnemmtmmy--------qy, gt: tttttttttttttttttttajdn
-rcqqqqaaaacccccycc8--q => -rcqqqqaaaacccccycc8--q, gt: aaaaaaaaaaaaaaaaaaaixvw
------33e-bfaaaaa------ => ------33e-bfaaaaa------, gt: uuuuuuuuuuuuuuuuuuupfyq
r----5t5y5aaaaa-------- => r----5t5y5aaaaa--------, gt: fffffffffffffffffffapap
deeeccccc2qqqm888zl---t => deeeccccc2qqqm888zl---t, gt: hhhhhhhhhhhhhhhhhhhlvjx
Minibatch[ 51- 100]: loss = 1.616731 * 73600, metric = 100.82% * 73600;
----------------------- => -----------------------, gt: kkkkkkkkkkkkkkkkkkkakyw
----------------------- => -----------------------, gt: ooooooooooooooooooopwtm
----------------------- => -----------------------, gt: jjjjjjjjjjjjjjjjjjjqpny
----------------------- => -----------------------, gt: iiiiiiiiiiiiiiiiiiidspr
----------------------- => -----------------------, gt: fffffffffffffffffffatyp
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvmccf
----------------------- => -----------------------, gt: dddddddddddddddddddsfyo
----------------------- => -----------------------, gt: yyyyyyyyyyyyyyyyyyylaph
----------------------- => -----------------------, gt: kkkkkkkkkkkkkkkkkkkacay
----------------------- => -----------------------, gt: uuuuuuuuuuuuuuuuuuujuqs
----------------------- => -----------------------, gt: sssssssssssssssssssovjp
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvibma
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvaajt
----------------------- => -----------------------, gt: tttttttttttttttttttdhfo
----------------------- => -----------------------, gt: yyyyyyyyyyyyyyyyyyycmbh
Minibatch[ 101- 150]: loss = 0.026177 * 73600, metric = 100.00% * 73600;
----------------------- => -----------------------, gt: iiiiiiiiiiiiiiiiiiiavoo
----------------------- => -----------------------, gt: lllllllllllllllllllaara
----------------------- => -----------------------, gt: pppppppppppppppppppmufu
----------------------- => -----------------------, gt: sssssssssssssssssssaacd
----------------------- => -----------------------, gt: uuuuuuuuuuuuuuuuuuujulx
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvoaqy
----------------------- => -----------------------, gt: dddddddddddddddddddvjmr
----------------------- => -----------------------, gt: oooooooooooooooooooxlvl
----------------------- => -----------------------, gt: dddddddddddddddddddqqlo
----------------------- => -----------------------, gt: wwwwwwwwwwwwwwwwwwwwrvx
----------------------- => -----------------------, gt: pppppppppppppppppppxuxi
----------------------- => -----------------------, gt: bbbbbbbbbbbbbbbbbbbkbqv
----------------------- => -----------------------, gt: ppppppppppppppppppplpha
----------------------- => -----------------------, gt: dddddddddddddddddddilol
----------------------- => -----------------------, gt: dddddddddddddddddddqnwf
我的问题是如何让网络学会输出正确的字幕。我想补充一点,我成功地设法使用相同的技术训练模型,但在pytorch中制作,因此图像或标签不太可能是问题。另外,有没有更好的方法将卷积层的输出转换为带动态轴的序列,以便我仍然可以将它与 forward_backward()函数一起使用?
答案 0 :(得分:1)
CNTK学习者默认使用聚合梯度,以适应具有变体小批量大小的分布式培训。但是,对于像adadelta这样的adagrad风格的学习者来说,聚合渐变不起作用。请尝试use_mean_gradient = True:
learner = C.adadelta(z.parameters, lr, use_mean_gradient=True)
答案 1 :(得分:0)
有很多事情使CNTK难以训练CRNN模型(正确设置标签格式的技巧很棘手,整个LabelsToGraph转换,没有转录错误度量标准等)。这是可以正常运行的模型的实现:
https://github.com/BenjaminTrapani/SceneTextOCR/tree/master
它依赖于CNTK的一个分支,该分支修复了图像读取器错误,提供了转录错误功能并提高了文本格式读取器的性能。它还提供了一个将从mjsynth数据集生成文本格式标签的应用程序。作为参考,以下是格式化标签的方法:
513528 |textLabel 7:2
513528 |textLabel 26:1
513528 |textLabel 0:2
513528 |textLabel 26:1
513528 |textLabel 20:2
513528 |textLabel 26:1
513528 |textLabel 11:2
513528 |textLabel 26:1
513528 |textLabel 8:2
513528 |textLabel 26:1
513528 |textLabel 4:2
513528 |textLabel 26:1
513528 |textLabel 17:2
513528 |textLabel 26:1
513528 |textLabel 18:2
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528 |textLabel 26:1
513528
是序列ID,并且应与同一样本的对应图像数据序列ID相匹配。 textLabel
用于为小批量源创建流。您可以使用C ++创建流,如下所示:
StreamConfiguration textLabelConfig(L"textLabel", numClasses, true, L"textLabel");
26
是用于CTC解码的空白字符的索引。 “:”之前的其他值是标签的字符代码。 1
将对序列中的每个向量进行1热编码。有大量的尾随空白字符可确保序列与支持的最大序列长度一样长,因为在编写本文时,CTC损失函数实现不支持可变长度序列。