Question

我正在尝试将CNN-LSTM网络的工作图像从TensorFlow转换为CNTK，并且我认为这是一个经过正确训练的模型，但我很难弄清楚如何从最终训练的CNTK模型中提取预测。 / p>

这是我正在使用的一般架构：这是我的CNTK模型：

def create_lstm_model(image_features, text_features):
    embedding_dim = 512
    hidden_dim = 512
    cell_dim = 512
    vocab_dim = 77

    image_embedding = Embedding(embedding_dim)
    text_embedding = Embedding(embedding_dim)
    lstm_classifier = Sequential([Stabilizer(),
                                  Recurrence(LSTM(hidden_dim)),
                                  Recurrence(LSTM(hidden_dim)),
                                  Stabilizer(),
                                  Dense(vocab_dim)]) 

    embedded_images = BatchNormalization()(image_embedding(image_features))
    embedded_text = text_embedding(text_features)
    lstm_input = C.plus(embedded_images, embedded_text)
    lstm_input = C.dropout(lstm_input, 0.5)
    output = lstm_classifier(lstm_input)    

    return output

我使用以下结构提供CTF格式的数据，固定字幕序列大小为40：

def create_reader(path, is_training):
    return MinibatchSource(CTFDeserializer(path, StreamDefs(
        target_tokens = StreamDef(field='target_tokens', shape=vocab_len, is_sparse=True),
        input_tokens = StreamDef(field='input_tokens', shape=vocab_len, is_sparse=True),
        image_features = StreamDef(field='image_features', shape=image_features_dim, is_sparse=False)
    )), randomize = is_training, max_sweeps = INFINITELY_REPEAT if is_training else 1)

旁边：三个数据流的原因 - 我有一个输入图像特征向量（预训练的ResNet的最后2048-dim层），一系列输入文本标记和一系列输出文本标记。所以基本上我的CTF文件在序列方面看起来像：

0 | target_token_0  | input_token_0 | input_image_feature_vector (2048-dim)
0 | target_token_1  | input_token_1 | empty array of 2048 zeros
0 | target_token_2  | input_token_2 | empty array of 2048 zeros
...
0 | target_token_40 | input_token_40 | empty array of 2048 zeros
1 | target_token_0  | input_token_0 | input_image_feature_vector (2048-dim)
1 | target_token_1  | input_token_1 | empty array of 2048 zeros
1 | target_token_2  | input_token_2 | empty array of 2048 zeros
...
1 | target_token_40 | input_token_40 | empty array of 2048 zeros

基本上，我无法弄清楚如何切片＆amp;在CNTK中将两个序列拼接在一起（即使你可以很容易地拼接两个张量），所以我只是通过提供一个序列中第一个带有输入2048-dim图像特征向量的元素和一个序列中的其余元素来攻击它一个空的2048-dim零向量 - 设置为：

C.plus(embedded_images, embedded_text)

在上面的模型中 - 目标是基本上采用40 [2048]->[512]图像嵌入序列的第一个元素和 hack-splice （TM）它在最后一个40个[vocab_dim]->[512]字嵌入序列的39个元素。我指望为空图像矢量（2048零）学习非常空的[2048]->[512]图像嵌入，所以我将我的嵌入图像序列和元素方式添加到我的嵌入文本序列中，然后才进入LSTM。基本上，这个：

image embedding sequence: [-1, 40, 512]  (e.g., [-1, 0, 512])
text embedding sequence:  [-1, 40, 512]  (e.g., [-1, 1:40, 512)
+
---------------------------------------
lstm input sequence:      [-1, 40, 512]

这让我想到了我的实际问题。现在我有一个训练得很好的模型，我想从模型中提取字幕预测，基本上做这样的事情（从这个PyTorch image captioning tutorial）：

def sample(self, features, states=None):
    """Samples captions for given image features (Greedy search)."""
    sampled_ids = []
    inputs = features.unsqueeze(1)
    for i in range(20):                                      # maximum sampling length
        hiddens, states = self.lstm(inputs, states)          # (batch_size, 1, hidden_size), 
        outputs = self.linear(hiddens.squeeze(1))            # (batch_size, vocab_size)
        predicted = outputs.max(1)[1]
        sampled_ids.append(predicted)
        inputs = self.embed(predicted)
        inputs = inputs.unsqueeze(1)                         # (batch_size, 1, embed_size)
    sampled_ids = torch.cat(sampled_ids, 1)                  # (batch_size, 20)
    return sampled_ids.squeeze()

问题是，我无法弄清楚CNTK的等价物是从LSTM中获取隐藏状态并在下一步中将其抽回：

hiddens, states = self.lstm(inputs, states)

这在CNTK中如何运作？

Answer 1

我认为您正在寻找的功能是RecurrenceFrom()。其文档包含以下示例：

Example:
 >>> from cntk.layers import *
 >>> from cntk.layers.typing import *

 >>> # a plain sequence-to-sequence model in training (where label length is known)
 >>> en = C.input_variable(**SequenceOver[Axis('m')][SparseTensor[20000]])  # English input sentence
 >>> fr = C.input_variable(**SequenceOver[Axis('n')][SparseTensor[30000]])  # French target sentence

 >>> embed = Embedding(300)
 >>> encoder = Recurrence(LSTM(500), return_full_state=True)
 >>> decoder = RecurrenceFrom(LSTM(500))       # decoder starts from a data-dependent initial state, hence -From()
 >>> emit = Dense(30000)
 >>> h, c = encoder(embed(en)).outputs         # LSTM encoder has two outputs (h, c)
 >>> z = emit(decoder(h, c, sequence.past_value(fr)))   # decoder takes encoder outputs as initial state
 >>> loss = C.cross_entropy_with_softmax(z, fr)

Answer 2

我最近遇到了同样的问题。我在下面提供了一个示例代码，用于了解如何使用Recurrence和RNNStep个对象来获取预测。请注意，预测只是下一个时期状态，因为它没有以任何方式进行转换。我使用了RNNStep，因为它包含的参数较少，因此可以轻松构建一个简单的示例。同样的逻辑适用于LSTM。

该示例的第一部分构建了一个白噪声过程（可能是任何过程）：

process_dim = 1
n_periods = 100
np.random.seed(0)
x_data = np.random.multivariate_normal(mean=np.array([0.0], dtype=np.float32),
                                       cov=np.matrix([[0.1]], dtype=np.float32),
                                       size=n_periods).astype(np.float32)

然后我构建RNN并以一种易于理解其内部计算的方式设置其参数。

x = cntk.sequence.input_variable(shape=process_dim)
initial_state = 10
W = 1
H = 1
b = 1
with cntk.layers.default_options(initial_state=initial_state):
    model = cntk.layers.Recurrence(cntk.layers.RNNStep(shape=process_dim,
                                                       activation=lambda x_: x_,
                                                       init=W,
                                                       init_bias=b))(x)

注1：参数init同时设置H和W. 注意2：如果您不使用with语句，则初始状态设置为0也可以。

让我们首先看看如何提前评估RNN，即将前一个周期状态（在我们的情况下为初始状态）与当前周期外部输入（x_data的一个条目）组合在一起获得下一个州。下面的脚本通过在某些给定输入上调用model.eval()然后手动执行实际RNN计算来验证输出。

x0 = np.array(x_data[0]).reshape((1, 1))
x1 = np.array(x_data[1]).reshape((1, 1))
x2 = np.array(x_data[2]).reshape((1, 1))

h00 = model.eval({x: x0})
h01 = model.eval({x: x1})
h02 = model.eval({x: x2})

print("x0: {}, h00: {}".format(x0, h00))
print("x1: {}, h01: {}".format(x1, h01))
print("x2: {}, h02: {}".format(x2, h02))

# validation
h00_m = initial_state * H + x0[0, 0] * W + b
h01_m = initial_state * H + x1[0, 0] * W + b
h02_m = initial_state * H + x2[0, 0] * W + b

print("validation of state computed manually")
print("x0: {}, h00_m: {}".format(x0[0, 0], h00_m))
print("x1: {}, h01_m: {}".format(x1[0, 0], h01_m))
print("x2: {}, h02_m: {}".format(x2[0, 0], h02_m))

现在让我们看看如何使用model.eval()获得多个时段提前预测。同样，下面的脚本会执行此操作并验证输出。

# x0 now contains multiple entries
x0 = np.array(x_data[0:4]).reshape((4, 1))
h1234 = model.eval({x: x0})
print("forecasts obtained by model automatically")
print(h1234)

h01_m = initial_state * H + x0[0, 0] * W + b
h12_m = h01_m * H + x0[1, 0] * W + b
h23_m = h12_m * H + x0[2, 0] * W + b
h34_m = h23_m * H + x0[3, 0] * W + b

print("forecasts computed manually")
print(h01_m)
print(h12_m)
print(h23_m)
print(h34_m)

因为RNNStep在Recurrence内使用，它会根据您提供给模型函数的输入数自动调用RNNStep。每次调用RNNStep时，它都会收到前一个句点状态和一个给予model.eval()的数组条目。这与培训时的情况一致。

以下是LSTM和RNN的一些有用文档： https://docs.microsoft.com/en-us/python/api/cntk.layers.blocks?view=cntk-py-2.2 它的内容与https://cntk.ai/pythondocs/index.html不完全相同，在这种情况下它更有用。

很抱歉，我对你的问题的ctf部分无能为力，我对此知之甚少，而且我还发布了一个尚待解答的问题。

CNTK：如何初始化LSTM隐藏状态？

2 个答案: