为什么GPU内存不一样?

时间:2017-01-09 02:14:27

标签: tensorflow

我使用两种方法来训练我自己的数据,第一种方法是从头开始训练模型,第二种方法是使用微调(根据https://github.com/tensorflow/models/tree/master/slim),所有参数对于两种方法都是相同的,除了但是,检查点设置,第一种方法总是发生内存不足(GPU)。减少第一种方法的批量大小时,没关系。是什么原因?

可在网站上找到更多信息:https://github.com/tensorflow/models/issues/848

1. train from scratch
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=my_data
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_resnet_v2 \
--max_number_of_steps=500000 \
--batch_size=48 \
--num_readers=16 \
--learning_rate=0.1 \
--learning_rate_decay_type=exponential \
--num_epochs_per_decay=4.0 \
--learning_rate_decay_factor=0.9 \
--save_interval_secs=6000 \
--save_summaries_secs=1000 \
--log_every_n_steps=100 \
--optimizer=adam \
--opt_epsilon=1e-1 \
--weight_decay=0.0004 \
--num_clones=6
the information of out of memory is like:
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 1, Chunks in use: 0 768B allocated for chunks. 768B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 57.57MiB was 32.00MiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0b00 of size 256
................................
...............................
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 265531392 totalling 506.46MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 10.56GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 11343019213
InUse: 11336882944
MaxInUse: 11340513536
NumAllocs: 11965
MaxAllocSize: 2632187904

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 24.38MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[48,8,8,2080]
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 24.38MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[48,8,8,2080]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[48,17,17,1088]
[[Node: clone_0/InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/BiasAdd = BiasAdd[T=DT_FLOAT, data_format="NHWC", _device="/job:localhost/replica:0/task:0/gpu:0"](clone_0/InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/convolution, InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/biases/read)]]
[[Node: train_op/_24881 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_188818_train_op", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]



2. training from fine-tuning
    the command is like:
    python train_image_classifier.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_name=my_data\
    --dataset_split_name=train \
    --dataset_dir=${DATASET_DIR} \
    --model_name=inception_resnet_v2 \
    --checkpoint_path=${PRETRAINED_CHECKPOINT_DIR}/inception_resnet_v2_2016_08_30.ckpt \
    --checkpoint_exclude_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
    --trainable_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
    --max_number_of_steps=500000 \
    --batch_size=48 \
    --num_readers=16 \
    --learning_rate=0.1 \
    --learning_rate_decay_type=exponential \
    --num_epochs_per_decay=4.0 \
    --learning_rate_decay_factor=0.9 \
    --save_interval_secs=6000 \
    --save_summaries_secs=1000 \
    --log_every_n_steps=100 \
    --optimizer=adam \
    --opt_epsilon=1e-1 \
    --weight_decay=0.0004 \
    --num_clones=6
    the running information is as follows:
    W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xcad55b0
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 4 with properties:
    name: Tesla K40m
    major: 3 minor: 5 memoryClockRate (GHz) 0.745
    pciBusID 0000:30:00.0
    Total memory: 11.25GiB
    Free memory: 11.12GiB
    W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x3a765740
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 5 with properties:
    name: Tesla K40m
    major: 3 minor: 5 memoryClockRate (GHz) 0.745
    pciBusID 0000:33:00.0
    Total memory: 11.25GiB
    Free memory: 11.12GiB
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 0
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 1
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 2
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 3
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 0
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 1
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 2
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 3
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 4 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 4: N N N N Y Y
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 5: N N N N Y Y
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:09:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:0a:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:0d:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:0e:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K40m, pci bus id: 0000:30:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K40m, pci bus id: 0000:33:00.0)
    INFO:tensorflow:Starting Session.
    INFO:tensorflow:Starting Queues.
    INFO:tensorflow:global_step/sec: 0
    INFO:tensorflow:global step 100: loss = 19.9046 (2.07 sec/step)
    INFO:tensorflow:global step 200: loss = 19.8159 (2.64 sec/step)
    INFO:tensorflow:global step 300: loss = 19.7198 (2.82 sec/step)

    train successfully.
3. decrease the batch_size of the first method
If the batch_size is decreased to 28, the training can run successfully, however another problem is found, the processing time of each step becomes longer (3.48 sec/step), while fine-tuning method is (2.07 sec/step).
the command is:
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=my_data\
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_resnet_v2 \
--max_number_of_steps=500000 \
--batch_size=28 \
--num_readers=16 \
--learning_rate=0.1 \
--learning_rate_decay_type=exponential \
--num_epochs_per_decay=4.0 \
--learning_rate_decay_factor=0.9 \
--save_interval_secs=6000 \
--save_summaries_secs=1000 \
--log_every_n_steps=100 \
--optimizer=adam \
--opt_epsilon=1e-1 \
--weight_decay=0.0004 \
--num_clones=6

the running information:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 4 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 4: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 5: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:0d:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:0e:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K40m, pci bus id: 0000:30:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K40m, pci bus id: 0000:33:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1902 get requests, put_count=1100 evicted_count=1000 eviction_rate=0.909091 and unsatisfied allocation rate=1
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.29GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
INFO:tensorflow:global step 100: loss = 26.2026 (3.49 sec/step)
INFO:tensorflow:global step 200: loss = 25.8227 (3.47 sec/step)
INFO:tensorflow:global_step/sec: 0.266924
INFO:tensorflow:global step 300: loss = 25.2874 (3.48 sec/step)
INFO:tensorflow:global step 400: loss = 24.7210 (3.47 sec/step)
INFO:tensorflow:global step 500: loss = 24.2435 (3.47 sec/step)

1 个答案:

答案 0 :(得分:0)

批量大小表示您要同时上传到gpu进行培训的测试用例数。减少批量大小将减少每个训练步骤的内存占用量。为了获得最佳的训练效果,您可能希望找到训练集和硬件的“最佳位置”(最大批量大小而不会进入OOM)。