Question

我在Google Cloud Platform实例上运行Keras模型时遇到问题。
该模型如下：

n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]

train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))

verbose, epochs, batch_size = 1, 1, 64  # low number of epochs just for testing purpose
with tf.device('/cpu:0'):
    m = Sequential()
    m.add(CuDNNLSTM(20, input_shape=(n_timesteps, n_features)))
    m.add(LeakyReLU(alpha=0.1))
    m.add(RepeatVector(n_outputs))
    m.add(CuDNNLSTM(20, return_sequences=True))
    m.add(LeakyReLU(alpha=0.1))
    m.add(TimeDistributed(Dense(20)))
    m.add(LeakyReLU(alpha=0.1))
    m.add(TimeDistributed(Dense(1)))

self.model = multi_gpu_model(m, gpus=8)
self.model.compile(loss='mse', optimizer='adam')

self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)

从上面的代码中可以看到，我在带有8个GPU的计算机（Nvidia Tesla K80）上运行模型。
火车运行良好，没有任何错误。但是，预测失败并返回以下错误：

W tensorflow / core / framework / op_kernel.cc：1502] OP_REQUIRES在cudnn_rnn_ops.cc:1336失败：未知：CUDNN_STATUS_BAD_PARAM 在tensorflow / stream_executor / cuda / cuda_dnn.cc（1285）中：'cudnnSetTensorNdDescriptor（tensor_desc.get（），data_type，sizeof（dims）/ sizeof（dims [0]），暗淡，步幅）'

这里是运行预测的代码：

self.model.predict(input_x)

我注意到的是，如果删除多GPU数据并行性的代码，则该代码可以在单个GPU上很好地工作。
更准确地说，如果我注释此行，则代码可以正常工作

self.model = multi_gpu_model(m, gpus=8)

我想念什么？

虚拟环境信息

cudatoolkit-10.0.130
cudnn-7.6.4
keras-2.2.4
keras-应用程序-1.0.8
keras-base-2.2.4
keras-gpu-2.2.4
python-3.6

更新

train_x.shape = (1441, 288, 1)
train_y.shape = (1441, 288, 1)
input_x.shape = (1, 288, 1)

在奥利维尔·德海恩（Olivier Dehaene）的回复之后，我尝试了他的建议，该建议成功了。
我试图修改input_x形状以获得（8，288，1）。
为此，我还修改了train_x和train_y形状。
回顾一下：

train_x.shape = (8065, 288, 1)
train_y.shape = (8065, 288, 1)
input_x.shape = (8, 288, 1)

但是现在在训练阶段，在这一行中我遇到了相同的错误：

self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)

Answer 1

从tf.keras.utils.multi_gpu_model中我们可以看到它以下列方式工作：

将模型的输入划分为多个子批。

在每个子批次上应用模型副本。每个模型副本均在专用GPU上执行。

将结果（在CPU上）串联在一起。

由于至少一个模型副本的CuDNNLSTM层的输入为空，因此您正在触发错误。这是因为除法运算要求：input // n_gpus > 0

尝试以下代码：

input_x = np.random.randn(8, n_timesteps, n_features)
model.predict(input_x)

运行keras multi_gpu_model的预测时出错

1 个答案: