说我们有一个像这样的功能:
def trn_l(totall_lc, totall_lw, totall_li, totall_lr):
self.model_large.cuda()
self.model_large.train()
self.optimizer_large.zero_grad()
for fb in range(self.fake_batch):
val_x, val_y = next(self.valid_loader)
val_x, val_y = val_x.cuda(), val_y.cuda()
logits_main, emsemble_logits_main = self.model_large(val_x)
cel = self.criterion(logits_main, val_y)
loss_weight = cel / (self.fake_batch)
loss_weight.backward(retain_graph=False)
cel = cel.cpu().detach()
emsemble_logits_main = emsemble_logits_main.cpu().detach()
totall_lw += float(loss_weight.item())
val_x = val_x.cpu().detach()
val_y = val_y.cpu().detach()
loss_weight = loss_weight.cpu().detach()
self._clip_grad_norm(self.model_large)
self.optimizer_large.step()
self.model_large.train(mode=False)
self.model_large = self.model_large.cpu()
return totall_lc, totall_lw, totall_li, totall_lr
在第一次调用时,它将分配8GB的GPU内存。在下一个呼叫中,没有分配新的内存,但仍占用8GB。我希望在调用它之后以及产生的第一个结果中分配有0个GPU内存或尽可能低的内存。
我尝试过的方法:在任何地方都进行retain_graph=False
和.cpu().detach()
-没有积极的影响。
之前的内存快照
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| Active memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 51200 KB | 51200 KB | 51200 KB | 0 B |
| from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |
| from small pool | 30720 KB | 30720 KB | 30720 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 18100 KB | 20926 KB | 56892 KB | 38792 KB |
| from large pool | 17408 KB | 18944 KB | 18944 KB | 1536 KB |
| from small pool | 692 KB | 2047 KB | 37948 KB | 37256 KB |
|---------------------------------------------------------------------------|
| Allocations | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| Active allocs | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 16 | 16 | 16 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 15 | 15 | 15 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 3 | 30 | 262 | 259 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 2 | 29 | 261 | 259 |
|===========================================================================|
并且在调用函数之后,
torch.cuda.empty_cache()
torch.cuda.synchronize()
我们得到:
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| Active memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 8818 MB | 9906 MB | 19618 MB | 10800 MB |
| from large pool | 8784 MB | 9874 MB | 19584 MB | 10800 MB |
| from small pool | 34 MB | 34 MB | 34 MB | 0 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 5427 KB | 3850 MB | 207855 MB | 207850 MB |
| from large pool | 0 KB | 3850 MB | 207494 MB | 207494 MB |
| from small pool | 5427 KB | 5 MB | 360 MB | 355 MB |
|---------------------------------------------------------------------------|
| Allocations | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| Active allocs | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 226 | 226 | 410 | 184 |
| from large pool | 209 | 209 | 393 | 184 |
| from small pool | 17 | 17 | 17 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 46 | 358 | 12284 | 12238 |
| from large pool | 0 | 212 | 7845 | 7845 |
| from small pool | 46 | 279 | 4439 | 4393 |
|===========================================================================|
答案 0 :(得分:2)
我认为另一个答案是正确的。分配和取消分配肯定在运行时发生,需要注意的是CPU代码与GPU代码异步运行,因此,如果要在其后保留更多内存,则需要等待任何重新分配发生。看看这个:
import torch
a = torch.zeros(100,100,100).cuda()
print(torch.cuda.memory_allocated())
del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())
输出
4000256
0
因此,您应该del
不需要的张量,并调用torch.cuda.synchronize()
以确保在CPU代码继续运行之前取消分配。
在您的特定情况下,函数trn_l
返回之后,该函数本地的变量以及在其他地方没有引用的所有变量将与相应的GPU张量一同释放。您需要做的就是在函数调用后通过调用torch.cuda.synchronize()
等待此事情发生。
答案 1 :(得分:0)
因此,Pytorch不会在训练时间内从GPU分配内存或取消分配内存。
来自https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly:
PyTorch使用缓存内存分配器来加速内存分配。因此,nvidia-smi中显示的值通常不能反映真实的内存使用情况。有关GPU内存管理的更多详细信息,请参见Memory management。
如果即使在退出Python后仍没有释放GPU内存,则很可能某些Python子进程仍然存在。您可以通过ps -elf找到它们。 grep python并使用kill -9 [pid]手动将其杀死。
您可以调用torch.cuda.empty_cache()
释放所有未使用的内存(但是,这不是一个好习惯,因为重新分配内存非常耗时)。 empty_cace()
的文档:https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache