Question

当我在单个 GPU 上执行推理批处理循环时，我遇到了性能很慢的问题。

这种缓慢的行为出现在第一批处理完成后 - 也就是GPU已经快满了，需要回收内存来接受下一批。

在原始 GPU 状态下 - 性能超快（如预期）。

我希望下面的代码片段和输出都能简明扼要地说明问题。

（为了简洁起见，我从代码片段中删除了打印和时间测量）

predictions = None

for i, batch in enumerate(self.test_dataloader):

    # if this line is active - the bottleneck after the first batch moves here, rather than below
    # i.e. when i > 0
    # torch.cuda.empty_cache()    

    # HUGE PERFORMANCE HIT HAPPENS HERE - after the first batch
    # i.e. when i > 0
    # obviously tensor.to(device) uses torch.cuda.empty_cache() internally when needed
    # and it is inexplicably SLOW
    batch = tuple(t.to(device) for t in batch)  # to GPU (or CPU) when gpu

    b_input_ids, b_input_mask, b_labels = batch

    with torch.no_grad():
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

    logits = outputs[0]
    logits = logits.detach()

    # that doesn't help alleviate the issue
    del outputs

    predictions = logits if predictions is None else torch.cat((predictions, logits), 0)            

    # nor do all of the below - freeing references doesn't help speeding up
    del logits
    del b_input_ids
    del b_input_mask
    del b_labels
    for o in batch:
        del o
    del batch

输出

start empty cache... 0.00082
end empty cache... 1.9e-05
start to device... 3e-06
end to device... 0.001179 - HERE - time is super fast (as expected)
start outputs... 8e-06
end outputs... 0.334536
logits... 6e-06
start detach... 1.7e-05
end detach... 0.004036

start empty cache... 0.335932
end empty cache... 4e-06
start to device... 3e-06
end to device... 16.553849 - HERE - time is ridiculously high - it's 16 seconds to move tensor to GPU
start outputs... 2.3e-05
end outputs... 0.020878
logits... 7e-06
start detach... 1.4e-05
end detach... 0.00036

start empty cache... 0.00082
end empty cache... 6e-06
start to device... 4e-06
end to device... 17.385204 - HERE - time is ridiculously high
start outputs... 2.9e-05
end outputs... 0.021351
logits... 4e-06
start detach... 1.3e-05
end detach... 1.1e-05

...

我是否遗漏了一些明显的东西，或者这是预期的 GPU 行为？

我在进行复杂的编码之前发布了这个问题，在我的服务器上可用的几个 GPU 和 CPU 之间玩弄。

提前致谢，阿尔伯特

编辑

已解决 问题是：在 DataLoader 构造函数中 - 我更改了 pin_memory to False（True 导致了问题）。这将 .to(device) 时间缩短了 350%-400%

self.test_dataloader = DataLoader(
            test_dataset,
            sampler=SequentialSampler(test_dataset),
            # batch_size=len(test_dataset)  # AKA - single batch - nope! no mem for that
            batch_size=BATCH_SIZE_AKA_MAX_ROWS_PER_GUESS_TO_FIT_GPU_MEM,
            # tests
            num_workers=8,
            # maybe this is the culprit as suggested by user12750353 in stackoverflow
            # pin_memory=True
            pin_memory=False
        )

Answer 1

如果您正确清除了对先前分配的变量的引用，则不应要求您清除缓存。缓存就像空闲，是脚本可以用于新变量的内存。

还要注意

a = torch.zeros(10**9, dtype=torch.float)
a = torch.zeros(10**9, dtype=torch.float)

需要 8GB 内存，即使 a 使用 4GB（1B 个元素，每个元素 4 个字节）。发生这种情况是因为 torch.zeros 会在 a 的先前内容被释放之前分配内存。这可能会在更大范围内发生在您的模型中，具体取决于它的实施方式。

编辑 1

一件可疑的事情是，您一次将一个示例加载到 GPU 中。

只是为了说明我的意思

import torch
device = 'cuda'
batch = torch.zeros((4500, 10));

将批处理创建为元组

batch_gpu = tuple(t.to(device) for t in batch) 
torch.cuda.synchronize()

每个循环 254 ms ± 36 ms（平均值 ± 标准偏差，7 次运行，每次 1 次循环）

将批次创建为列表

batch_gpu = list(t.to(device) for t in batch) 
torch.cuda.synchronize()

每个循环 235 ms ± 3.74 ms（平均值 ± 标准差。7 次运行，每次 1 次循环）

batch_gpu = batch.to(device)
torch.cuda.synchronize()

每个循环 115 µs ± 2.9 µs（平均值 ± 标准偏差，7 次运行，每次 10000 次循环）

在这个例子中，一次复制一个例子要快 2000 倍。

请注意 GPU 与 CPU 异步工作。因此，您可以继续调用将在操作完成之前返回的函数。为了进行有意义的测量，您可以调用 synchronize 来明确时间界限。

要检测的代码是这个

for i, batch in enumerate(self.test_dataloader):

    # torch.cuda.empty_cache()    
    # torch.synchronize() # if empty_cache is used
    

    # start timer for copy
    batch = tuple(t.to(device) for t in batch)  # to GPU (or CPU) when gpu
    torch.cuda.synchronize()
    # stop timer for copy

    b_input_ids, b_input_mask, b_labels = batch

    # start timer for inference
    with torch.no_grad():
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    torch.cuda.synchronize()
    # stop timer for inference


    logits = outputs[0]
    logits = logits.detach()
    # if you copy outputs to CPU it will be synchronized

Torch.cuda.empty_cache() 性能非常非常慢

编辑

1 个答案: