Question

我正在使用Glove预训练的嵌入来训练自己的网络。我用

self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)

和tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)

初始化并查找嵌入。但是，当我进行培训时，错误显示为（错误消息太长，我在此添加了我认为最重要的错误消息）

正在使用的块总数：3.85GiB，限制：
  11281927373使用中：4131524096最大使用中：
  6826330624 NumAllocs：47061 MaxAllocSize：
  2842165248 OP_REQUIRES在matmul_op.cc:478失败：资源   精疲力竭：分配形状[4800,400001]和类型的张量时，OOM   通过分配器浮动在/ job：localhost / replica：0 / task：0 / device：GPU：0上   GPU_0_bfc

但是，从错误STATS来看，我的tesla k80的最大内存为11G，在这里它仅用于40％-70％-大约4〜7 G，我的gpu怎么会因为没有内存而耗尽内存最多使用总内存的70％？我只是无法理解其工作原理。

我还尝试了其他帖子中的方法，例如 https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu

并将我的批处理大小限制为16或config.gpu_options.allow_growth = True或config.gpu_options.allocator_type = 'BFC'或config.gpu_options.per_process_gpu_memory_fraction = 0.4，错误仍然存在。

这里有帮助吗？

了解张量流的OOM机制

0 个答案: