应用错误收集

我正在运行带有以下选项的train_image_classifier.py（https://github.com/tensorflow/models/tree/master/research/slim/train_image_classifier.py）。

Ubuntu 18.04，gtx 960 2G gpu卡，数据集“ cifar10”，型号“ mobilenet_v2”。在我的docker容器中，gpu卡的内存使用情况如附件中的屏幕快照所示。

我遇到以下问题。

1）使用默认的batch_size（我认为是32）时，会报告此类错误： “ 2019-03-04 01：39：19.606917：我tensorflow / core / common_runtime / bfc_allocator.cc：647]统计信息：限制：1244528640 使用中：1235437312 最大使用量：1235448064 NumAllocs：1323 MaxAllocSize：485858304

2019-03-04 01：39：19.606936：W tensorflow / core / common_runtime / bfc_allocator.cc：271] *********************** ******************** x ************** xx ************** xxxxxxxx ************* xxxxx 2019-03-04 01：39：19.878941：W tensorflow / core / framework / op_kernel.cc：1273] OP_REQUIRES在conv_ops.cc:746失败：资源耗尽：分配带有shape [32,192,28,28]的张量时OOM在/ job：localhost / replica：0 / task：0 / devicResource上用float类型分配：GPU_0_bfc耗尽了e：GPU：0”

2）设置batch_size = 16时，发生NaN。

3）遵循指南（ https://github.com/tensorflow/models/tree/master/research/inception#adjusting-memory-demands），我将input_queue_memory_factor降低为8、4、2、1，相同的错误报告为1）。

您能给我任何新的指导吗？例如，我需要尝试其他一些应用吗？

如果除了购买其他gpu卡以增加gpu内存外，没有其他解决方案，我可以购买账单。但是，如何估算我需要为gpu卡投入的内存？

感谢 ![enter image description here ] 1

Tensorflow的“ train_image_classifier.py”：用gtx 960 2g gpu卡耗尽资源

0 个答案: