Question

我正在尝试使用tensorflow实现跳过思维模型，并且放置了当前版本here。

目前我使用的是我的机器的一个GPU（总共2个GPU），GPU信息是

d = 
a1   a2   a3   a4   a5   a6   a7   a8   a9   a10
a2   a3   a4   a5   a6   a7   a8   a9   a10  a11 
..................................
.................................
a990 a991 a992 a993 a994 a995 a996 a997 a998 a999
a991 a992 a993 a994 a995 a996 a997 a998 a999 a1000

然而，当我尝试向模型提供数据时，我得到了OOM。我尝试调试如下：

我在运行2017-09-06 11:29:32.657299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate (GHz) 1.683 pciBusID 0000:02:00.0 Total memory: 10.91GiB Free memory: 10.75GiB

后立即使用以下代码段

sess.run(tf.global_variables_initializer())

并获得logger.info('Total: {} params'.format( np.sum([ np.prod(v.get_shape().as_list()) for v in tf.trainable_variables() ])))，如果全部使用2017-09-06 11:29:51,333 INFO main main.py:127 - Total: 62968629 params，则约为240Mb。 tf.float32的输出是

tf.global_variables

在我的训练短语中，我有一个形状为[<tf.Variable 'embedding/embedding_matrix:0' shape=(155229, 200) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>, <tf.Variable 'decoder/weights:0' shape=(200, 155229) dtype=float32_ref>, <tf.Variable 'decoder/biases:0' shape=(155229,) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>, <tf.Variable 'global_step:0' shape=() dtype=int32_ref>]的数据数组，即(164652, 3, 30)，此处sample_size x 3 x time_step表示前一句，当前句和下一句。此培训数据的大小约为3，并存储在57Mb中。然后我用write一个生成器函数来获取句子，看起来像

loader

训练循环看起来像

def iter_batches(self, batch_size=128, time_major=True, shuffle=True):

    num_samples = len(self._sentences)
    if shuffle:
        samples = self._sentences[np.random.permutation(num_samples)]
    else:
        samples = self._sentences

    batch_start = 0
    while batch_start < num_samples:
        batch = samples[batch_start:batch_start + batch_size]

        lens = (batch != self._vocab[self._vocab.pad_token]).sum(axis=2)
        y, x, z = batch[:, 0, :], batch[:, 1, :], batch[:, 2, :]
        if time_major:
            yield (y.T, lens[:, 0]), (x.T, lens[:, 1]), (z.T, lens[:, 2])
        else:
            yield (y, lens[:, 0]), (x, lens[:, 1]), (z, lens[:, 2])
        batch_start += batch_size

然后我得到了一个OOM。如果我注释掉整个for epoch in num_epochs: batches = loader.iter_batches(batch_size=args.batch_size) try: (y, y_lens), (x, x_lens), (z, z_lens) = next(batches) _, summaries, loss_val = sess.run( [train_op, train_summary_op, st.loss], feed_dict={ st.inputs: x, st.sequence_length: x_lens, st.previous_targets: y, st.previous_target_lengths: y_lens, st.next_targets: z, st.next_target_lengths: z_lens }) except StopIteraton: ...正文（不提供数据），则脚本运行正常。

我不知道为什么我会在如此小的数据范围内获得OOM。使用try我总是得到

nvidia-smi

我无法看到我的脚本的实际 GPU使用情况，因为tensorflow始终会窃取所有内存。这里的实际问题是我不知道如何调试它。

我在StackOverflow上阅读了一些关于OOM的帖子。大多数情况发生在向模型提供大量测试集数据并通过小批量提供数据时可以避免这个问题。但我不知道为什么在我的11Gb 1080Ti中看到如此小的数据和参数组合很糟糕，因为错误它只是试图分配一个大小为Wed Sep 6 12:03:37 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.59 Driver Version: 384.59 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 0% 44C P2 60W / 275W | 10623MiB / 11172MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 0% 43C P2 62W / 275W | 10621MiB / 11171MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 32748 C python3 10613MiB | | 1 32748 C python3 10611MiB | +-----------------------------------------------------------------------------+的矩阵。（解码器的输出矩阵[3840 x 155229]，3840 = 30(time_steps) x 128(batch_size)是vocab_size）。

任何帮助将不胜感激。提前谢谢。

Answer 1

让我们逐个分开问题：

关于预先分配所有内存的tensorflow，您可以使用以下代码片段让tensorflow在需要时分配内存。这样你就可以理解事情的进展。

gpu_options = tf.GPUOptions(allow_growth=True)
session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

如果您愿意，这与tf.Session()相同，而不是tf.InteractiveSession()。

关于尺寸的第二件事，由于没有关于您的网络规模的信息，我们无法估计出现了什么问题。但是，您也可以逐步调试所有网络。例如，仅使用一个图层创建网络，获取其输出，创建会话和源值一次，并可视化您消耗的内存量。迭代此调试会话，直到您看到内存不足的程度。

请注意，3840 x 155229输出确实是真正的输出。它意味着~600M神经元，每层只有~2.22GB。如果你有任何类似大小的图层，所有这些图层将相加，以便快速填充你的GPU内存。

此外，这仅适用于正向，如果您使用此层进行训练，优化程序添加的反向传播和图层会将此大小乘以2.因此，对于训练，您只需为输出图层消耗~5 GB。

我建议您修改网络并尝试减少批量大小/参数计数以使您的模型适合GPU

Answer 2

这在技术上可能没有意义，但经过一段时间的实验，这是我发现的。

环境：Ubuntu 16.04

<小时/> 运行命令

时

NVIDIA-SMI

您将获得已安装的Nvidia显卡的总内存消耗。示例如图

所示

当您运行神经网络时，您的消费可能会变为

内存消耗通常给予python。由于某些奇怪的原因，如果此进程无法成功终止，则永远不会释放内存。如果您尝试运行神经网络应用程序的另一个实例，您将收到内存分配错误。困难的方法是尝试找出使用进程ID终止此进程的方法。简单的方法是重新启动计算机并重试。如果它是与代码相关的错误，那么这将不起作用。

Answer 3

您正在耗尽内存，可以减小批次大小，这虽然会减慢训练过程的速度，但可以让您拟合数据。

Answer 4

我知道您的问题是关于var myTuple = GetTuple() // this returns the Tuple, see GetTuple declaration Console.WriteLine(myTupple.Item1); Console.WriteLine(myTupple.Item2); Console.WriteLine(myTupple.Item3);的。无论如何，将tensorflow和一个Keras后端一起使用是我的用例，并导致相同的 OOM -Error。

我将Keras与tensorflow结合使用的解决方案是使用Keras的tf-backend方法。在此之前，我只使用了fit_generator()方法（导致 OOM -Error）。

fit()通常在您无法将数据装入主内存或必须访问与GPU训练并行的CPU调用时很有用。例如，请参见以下文档摘录：

生成器与模型并行运行，以提高效率。例如，这使您可以并行地对CPU上的图像进行实时数据增强，从而在GPU上训练模型。

显然，这也有助于防止图形卡内存中的内存溢出。

编辑：如果您需要一些有关如何开发自己的（线程安全的）生成器以扩展Keras的fit_generator()类（然后可用于Sequence）的启发，您可以看看我在this Q&A中提供的一些信息。

Answer 5

重新启动jupyter笔记本对我有用

在分配具有形状的张量时理解ResourceExhaustedError：OOM

5 个答案: