运行时错误:CUDA 内存不足

时间:2021-02-17 14:24:07

标签: pytorch

我收到此错误:

RuntimeError: CUDA out of memory

GPU 0; 1.95 GiB 总容量; 1.23 GiB 已经分配了 PyTorch 总共预留的 1.27 GiB

但这并不是内存不足,(在我看来)PyTorch 分配了错误的内存大小。我确实将批量大小更改为 1,杀死所有使用内存的应用程序,然后重新启动,但没有任何工作。

这就是我运行它的方式,请告诉我需要什么信息来修复它,或者我应该在哪里检查?谢谢。

python train.py --img 416 --batch 16 --epochs 1 \\ 
--data '../data.yaml' --cfg ./models/yolov4-csp.yaml \\ 
--weights '' --name yolov4-csp-results  --cache 
Using CUDA device0 _CudaDeviceProperties(name='Quadro P620', total_memory=2000MB)
    
    Namespace(adam=False, batch_size=16, bucket='', cache_images=True, cfg='./models/yolov4-csp.yaml', data='../data.yaml', device='', epochs=1, evolve=False, global_rank=-1, hyp='data/hyp.scratch.yaml', img_size=[416, 416], local_rank=-1, logdir='runs/', multi_scale=False, name='yolov4-csp-results', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='', world_size=1)
    Start Tensorboard with "tensorboard --logdir runs/", view at http://localhost:6006/
    Hyperparameters {'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.5, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mixup': 0.0}
    Overriding ./models/yolov4-csp.yaml nc=80 with nc=1
    
                     from  n    params  module                                  arguments
      0                -1  1       928  models.common.Conv                      [3, 32, 3, 1]
      1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
      2                -1  1     20672  models.common.Bottleneck                [64, 64]
      3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
      4                -1  1    119936  models.common.BottleneckCSP             [128, 128, 2]
      5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
      6                -1  1   1463552  models.common.BottleneckCSP             [256, 256, 8]
      7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
      8                -1  1   5843456  models.common.BottleneckCSP             [512, 512, 8]
      9                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 2]
     10                -1  1  12858368  models.common.BottleneckCSP             [1024, 1024, 4]
     11                -1  1   7610368  models.common.SPPCSP                    [1024, 512, 1]
     12                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
     13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
     14                 8  1    131584  models.common.Conv                      [512, 256, 1, 1]
     15          [-1, -2]  1         0  models.common.Concat                    [1]
     16                -1  1   1642496  models.common.BottleneckCSP2            [512, 256, 2]
     17                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
     18                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
     19                 6  1     33024  models.common.Conv                      [256, 128, 1, 1]
     20          [-1, -2]  1         0  models.common.Concat                    [1]
     21                -1  1    411648  models.common.BottleneckCSP2            [256, 128, 2]
     22                -1  1    295424  models.common.Conv                      [128, 256, 3, 1]
     23                -2  1    295424  models.common.Conv                      [128, 256, 3, 2]
     24          [-1, 16]  1         0  models.common.Concat                    [1]
     25                -1  1   1642496  models.common.BottleneckCSP2            [512, 256, 2]
     26                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]
     27                -2  1   1180672  models.common.Conv                      [256, 512, 3, 2]
     28          [-1, 11]  1         0  models.common.Concat                    [1]
     29                -1  1   6561792  models.common.BottleneckCSP2            [1024, 512, 2]
     30                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 1]
     31      [22, 26, 30]  1     32310  models.yolo.Detect                      [1, [[12, 16, 19, 36, 40, 28], [36, 75, 76, 55, 72, 146], [142, 110, 192, 243, 459, 401]], [256, 512, 1024]]
    Model Summary: 334 layers, 5.24994e+07 parameters, 5.24994e+07 gradients
    
    Optimizer groups: 111 .bias, 115 conv.weight, 108 other
    Scanning labels ../train/labels.cache (78 found, 0 missing, 0 empty, 0 duplicate, for 78 images): 100%|█| 78/78 [00:00<0
    Caching images (0.0GB):   3%|█▌                                                          | 2/78 [00:00<00:03, 19.31it/Caching images (0.0GB):  54%|███████████████████████████████▏                          |Caching images (0.0GB): 100%|█████████████████████████████████████████████ █████████████| 78/78 [00:00<00:00, 305.27it/s]
    Scanning labels ../valid/labels.cache (15 found, 0 missing, 0 empty, 0 duplicate, for 15 images): 100%|█| 15/15 [00:00<0
    Caching images (0.0GB): 100%|█████████████████████████████████████████████]█████████████| 15/15 [00:00<00:00, 333.01it/s]
    
    Analyzing anchors... anchors/target = 4.64, Best Possible Recall (BPR) = 1.0000
    Image sizes 416 train, 416 test
    Using 8 dataloader workers
    Starting training for 1 epochs...
    
         Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
      0%|                                                                                             | 0/5 [00:04<?, ?it/s]
    Traceback (most recent call last):
      File "train.py", line 443, in <module>
        train(hyp, opt, device, tb_writer)
      File "train.py", line 256, in train
        pred = model(imgs)
      File "/home/ctdi/anaconda3/envs/scaled-yolov4.03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/ctdi/content/ScaledYOLOv4/models/yolo.py", line 109, in forward
        return self.forward_once(x, profile)  # single-scale inference, train
      File "/home/ctdi/content/ScaledYOLOv4/models/yolo.py", line 129, in forward_once
        x = m(x)  # run
      File "/home/ctdi/anaconda3/envs/scaled-yolov4.03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/ctdi/content/ScaledYOLOv4/models/common.py", line 47, in forward
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
      File "/home/ctdi/anaconda3/envs/scaled-yolov4.03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/ctdi/content/ScaledYOLOv4/models/common.py", line 31, in forward
        return self.act(self.bn(self.conv(x)))
      File "/home/ctdi/anaconda3/envs/scaled-yolov4.03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/ctdi/anaconda3/envs/scaled-yolov4.03/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
        self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
      File "/home/ctdi/anaconda3/envs/scaled-yolov4.03/lib/python3.6/site-packages/torch/nn/functional.py", line 2059, in batch_norm
        training, momentum, eps, torch.backends.cudnn.enabled
    RuntimeError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 1.95 GiB total capacity; 1.23 GiB already allocated; 26.94 MiB free; 1.27 GiB reserved in total by PyTorch)

1 个答案:

答案 0 :(得分:0)

我终于找到了。问题是,我使用的是新的 CUDA 11.2。那很糟。我删除它。并安装 CUDA 10.2。这解决了问题。