Question

说我们有一个像这样的功能：

        def trn_l(totall_lc, totall_lw, totall_li, totall_lr):
            self.model_large.cuda()
            self.model_large.train()
            self.optimizer_large.zero_grad()

            for fb in range(self.fake_batch):
                val_x, val_y = next(self.valid_loader)
                val_x, val_y = val_x.cuda(), val_y.cuda()

                logits_main, emsemble_logits_main = self.model_large(val_x)
                cel = self.criterion(logits_main, val_y)
                loss_weight = cel / (self.fake_batch)
                loss_weight.backward(retain_graph=False)
                cel = cel.cpu().detach()
                emsemble_logits_main = emsemble_logits_main.cpu().detach()

                totall_lw += float(loss_weight.item())
                val_x = val_x.cpu().detach() 
                val_y = val_y.cpu().detach()

            loss_weight = loss_weight.cpu().detach()
            self._clip_grad_norm(self.model_large)
            self.optimizer_large.step()
            self.model_large.train(mode=False)
            self.model_large = self.model_large.cpu()
            return totall_lc, totall_lw, totall_li, totall_lr

在第一次调用时，它将分配8GB的GPU内存。在下一个呼叫中，没有分配新的内存，但仍占用8GB。我希望在调用它之后以及产生的第一个结果中分配有0个GPU内存或尽可能低的内存。

我尝试过的方法：在任何地方都进行retain_graph=False和.cpu().detach()-没有积极的影响。

之前的内存快照

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   33100 KB |   33219 KB |   40555 KB |    7455 KB |
|       from large pool |    3072 KB |    3072 KB |    3072 KB |       0 KB |
|       from small pool |   30028 KB |   30147 KB |   37483 KB |    7455 KB |
|---------------------------------------------------------------------------|
| Active memory         |   33100 KB |   33219 KB |   40555 KB |    7455 KB |
|       from large pool |    3072 KB |    3072 KB |    3072 KB |       0 KB |
|       from small pool |   30028 KB |   30147 KB |   37483 KB |    7455 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   51200 KB |   51200 KB |   51200 KB |       0 B  |
|       from large pool |   20480 KB |   20480 KB |   20480 KB |       0 B  |
|       from small pool |   30720 KB |   30720 KB |   30720 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |   18100 KB |   20926 KB |   56892 KB |   38792 KB |
|       from large pool |   17408 KB |   18944 KB |   18944 KB |    1536 KB |
|       from small pool |     692 KB |    2047 KB |   37948 KB |   37256 KB |
|---------------------------------------------------------------------------|
| Allocations           |   12281    |   12414    |   12912    |     631    |
|       from large pool |       2    |       2    |       2    |       0    |
|       from small pool |   12279    |   12412    |   12910    |     631    |
|---------------------------------------------------------------------------|
| Active allocs         |   12281    |   12414    |   12912    |     631    |
|       from large pool |       2    |       2    |       2    |       0    |
|       from small pool |   12279    |   12412    |   12910    |     631    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      16    |      16    |      16    |       0    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |      15    |      15    |      15    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       3    |      30    |     262    |     259    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |       2    |      29    |     261    |     259    |
|===========================================================================|

并且在调用函数之后，

torch.cuda.empty_cache()
torch.cuda.synchronize()

我们得到：

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   10957 KB |    8626 MB |  272815 MB |  272804 MB |
|       from large pool |       0 KB |    8596 MB |  272477 MB |  272477 MB |
|       from small pool |   10957 KB |      33 MB |     337 MB |     327 MB |
|---------------------------------------------------------------------------|
| Active memory         |   10957 KB |    8626 MB |  272815 MB |  272804 MB |
|       from large pool |       0 KB |    8596 MB |  272477 MB |  272477 MB |
|       from small pool |   10957 KB |      33 MB |     337 MB |     327 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    8818 MB |    9906 MB |   19618 MB |   10800 MB |
|       from large pool |    8784 MB |    9874 MB |   19584 MB |   10800 MB |
|       from small pool |      34 MB |      34 MB |      34 MB |       0 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |    5427 KB |    3850 MB |  207855 MB |  207850 MB |
|       from large pool |       0 KB |    3850 MB |  207494 MB |  207494 MB |
|       from small pool |    5427 KB |       5 MB |     360 MB |     355 MB |
|---------------------------------------------------------------------------|
| Allocations           |    3853    |   13391    |   34339    |   30486    |
|       from large pool |       0    |     557    |   12392    |   12392    |
|       from small pool |    3853    |   12838    |   21947    |   18094    |
|---------------------------------------------------------------------------|
| Active allocs         |    3853    |   13391    |   34339    |   30486    |
|       from large pool |       0    |     557    |   12392    |   12392    |
|       from small pool |    3853    |   12838    |   21947    |   18094    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     226    |     226    |     410    |     184    |
|       from large pool |     209    |     209    |     393    |     184    |
|       from small pool |      17    |      17    |      17    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      46    |     358    |   12284    |   12238    |
|       from large pool |       0    |     212    |    7845    |    7845    |
|       from small pool |      46    |     279    |    4439    |    4393    |
|===========================================================================|

Answer 1

我认为另一个答案是正确的。分配和取消分配肯定在运行时发生，需要注意的是CPU代码与GPU代码异步运行，因此，如果要在其后保留更多内存，则需要等待任何重新分配发生。看看这个：

import torch 

a = torch.zeros(100,100,100).cuda()

print(torch.cuda.memory_allocated())

del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())

输出

4000256
0

因此，您应该del不需要的张量，并调用torch.cuda.synchronize()以确保在CPU代码继续运行之前取消分配。

在您的特定情况下，函数trn_l返回之后，该函数本地的变量以及在其他地方没有引用的所有变量将与相应的GPU张量一同释放。您需要做的就是在函数调用后通过调用torch.cuda.synchronize()等待此事情发生。

Answer 2

因此，Pytorch不会在训练时间内从GPU分配内存或取消分配内存。

来自https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly：

PyTorch使用缓存内存分配器来加速内存分配。因此，nvidia-smi中显示的值通常不能反映真实的内存使用情况。有关GPU内存管理的更多详细信息，请参见Memory management。

如果即使在退出Python后仍没有释放GPU内存，则很可能某些Python子进程仍然存在。您可以通过ps -elf找到它们。 grep python并使用kill -9 [pid]手动将其杀死。

您可以调用torch.cuda.empty_cache()释放所有未使用的内存（但是，这不是一个好习惯，因为重新分配内存非常耗时）。 empty_cace()的文档：https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache

如何确保PyTorch释放了GPU内存？

2 个答案: