我在本地计算机上运行tensorflow 1.7,其中包含2个GPU,每个大约8 GB。
训练(train.py)当我使用模型&fast;更快的时候,对象检测工作正常。[fast_rcnn_resnet101_coco'但是,当我试图跑出来时,fast_rcnn_nas_coco'它显示出资源耗尽的错误'
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-02 16:14:53.963966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1
2018-05-02 16:14:53.964071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-02 16:14:53.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1
2018-05-02 16:14:53.964091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y
2018-05-02 16:14:53.964097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N
2018-05-02 16:14:53.964566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7385 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-05-02 16:14:53.966360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7552 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
Limit: 7744048333
InUse: 7699536896
MaxInUse: 7699551744
NumAllocs: 10260
MaxAllocSize: 4076716032
2018-05-02 16:16:52.223943: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ***********************************************************************************x****************
2018-05-02 16:16:52.223967: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at depthwise_conv_op.cc:358 : Resource exhausted: OOM when allocating tensor with shape[64,672,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
我不确定!!如果它同时使用GPU,因为它在使用内存中显示为' 7699536896'。经过train.py后,我也尝试了
python train.py \
--logtostderr \
--worker_replicas=2 \
--pipeline_config_path=training/faster_rcnn_resnet101_coco.config \
--train_dir=training
如果有2个GPU可用,默认情况下tensorflow会选择它们吗?还是需要任何论据?
答案 0 :(得分:0)
我们使用{{1}}指定的GPU数量。对于NASNet案例,请尝试减小批量大小以使网络适合GPU。