Question

我是Keras的新手，我正在尝试为图片字幕项目实施模型。

我正在尝试从Image captioning pre-inject architecture重现该模型（图片来自本文：Where to put the image in an image captioning generator）（但略有不同：在每个时间步生成一个单词而不是仅生成一个单词最后的单个字），其中第一步的LSTM的输入是嵌入的CNN特征。 LSTM应支持可变输入长度，为此，我用零填充所有序列，以便所有序列都有maxlen时间步长。

我现在拥有的模型代码如下：

def get_model(model_name, batch_size, maxlen, voc_size, embed_size, 
        cnn_feats_size, dropout_rate):

    # create input layer for the cnn features
    cnn_feats_input = Input(shape=(cnn_feats_size,))

    # normalize CNN features 
    normalized_cnn_feats = BatchNormalization(axis=-1)(cnn_feats_input)

    # embed CNN features to have same dimension with word embeddings
    embedded_cnn_feats = Dense(embed_size)(normalized_cnn_feats)

    # add time dimension so that this layer output shape is (None, 1, embed_size)
    final_cnn_feats = RepeatVector(1)(embedded_cnn_feats)

    # create input layer for the captions (each caption has max maxlen words)
    caption_input = Input(shape=(maxlen,))

    # embed the captions
    embedded_caption = Embedding(input_dim=voc_size,
                                 output_dim=embed_size,
                                 input_length=maxlen)(caption_input)

    # concatenate CNN features and the captions.
    # Ouput shape should be (None, maxlen + 1, embed_size)
    img_caption_concat = concatenate([final_cnn_feats, embedded_caption], axis=1)

    # now feed the concatenation into a LSTM layer (many-to-many)
    lstm_layer = LSTM(units=embed_size,
                      input_shape=(maxlen + 1, embed_size),   # one additional time step for the image features
                      return_sequences=True,
                      dropout=dropout_rate)(img_caption_concat)

    # create a fully connected layer to make the predictions
    pred_layer = TimeDistributed(Dense(units=voc_size))(lstm_layer)

    # build the model with CNN features and captions as input and 
    # predictions output
    model = Model(inputs=[cnn_feats_input, caption_input], 
                  outputs=pred_layer)

    optimizer = Adam(lr=0.0001, 
                     beta_1=0.9, 
                     beta_2=0.999, 
                     epsilon=1e-8)

    model.compile(loss='categorical_crossentropy',optimizer=optimizer)
    model.summary()

    return model

模型（如上所示）编译时没有任何错误（参见：model summary），我设法使用我的数据进行训练。但是，它没有考虑到我的序列是零填充的事实，因此结果不准确。当我尝试更改嵌入层以支持屏蔽（同时确保我使用voc_size + 1而不是文档中提到的voc_size）时：

embedded_caption = Embedding(input_dim=voc_size + 1,
                             output_dim=embed_size,
                             input_length=maxlen, mask_zero=True)(caption_input)

我收到以下错误：

Traceback (most recent call last):
  File "/export/home/.../py3_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1567, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 200 and 1. Shapes are [200] and [1]. for 'concatenate_1/concat_1' (op: 'ConcatV2') with input shapes: [?,1,200], [?,25,1], [] and with computed input tensors: input[2] = <1>

我不知道为什么它说第二个数组的形状是[？，25,1]，因为我在连接之前打印它的形状它是[？，25,200] （应该是）。我不明白为什么在没有该参数的情况下编译和工作正常的模型存在问题，但我认为我缺少了一些东西。

我一直在考虑使用Masking层而不是mask_zero = True，但它应该在嵌入之前，文档说嵌入层应该是模型中的第一层（在输入之后）。

为了解决这个问题，我有什么可以改变的吗？或者有解决方法吗？

Answer 1

不等形状误差是指掩模而不是张量/输入。如果concatenate支持屏蔽，则需要handle mask propagation。您的final_cnn_feats没有掩码（None），而embedded_caption的掩码为(?, 25)。你可以这样做：

print(embedded_caption._keras_history[0].compute_mask(caption_input))

由于final_cnn_feats没有掩码，concatenate将give it a all non-zero mask用于正确的掩码传播。虽然这是正确的，但是掩模的形状与final_cnn_feats具有相同的形状，即(?, 1, 200)而不是(?, 1)，即在所有时间步骤而不是仅仅覆盖所有特征时间步。这是不相等的形状误差来自（(?, 1, 200) vs (?, 25)）。

要解决此问题，您需要为final_cnn_feats提供正确/匹配的掩码。现在我不熟悉你的项目。一种选择是将Masking图层应用于final_cnn_feats，因为它设计为mask timestep(s)。

final_cnn_feats = Masking()(RepeatVector(1)(embedded_cnn_feats))

仅当final_cnn_feats中的所有200个要素都不为零时，这才是正确的，即final_cnn_feats中始终至少有一个非零值。在这种情况下，Masking图层将提供(?, 1)掩码，并且不会屏蔽final_cnn_feats中的单个时间步。

Keras图像字幕模型由于前一层中mask_zero = True时的连接层而未编译

1 个答案: