Question

培训过程很奇怪。加载数据部分对于特定步骤将花费很长时间，而在接下来的几个步骤中，它变得非常快。此过程根据num_workers的参数一次又一次地重复。这正常吗？还是有任何类似于tensorflow prefech() part的方法？

环境：Pytorch，Ubuntu16.04，使用1个GPU / 2个GPU时存在问题

_NUM_WORKERS = 8
train_loader = DataLoader(
      trainset,
      batch_size=config.batch_size,
      shuffle=True,
      num_workers=_NUM_WORKERS,
      pin_memory=True)

_NUM_WORKERS = 8时：

% Time format: time for this step (average time per step) 
Epoch: [0][1/4642]      Time 8.363 (8.363) <--- look at here
Epoch: [0][2/4642]      Time 0.557 (4.460)
Epoch: [0][3/4642]      Time 0.564 (3.161)
Epoch: [0][4/4642]      Time 0.562 (2.512)
Epoch: [0][5/4642]      Time 0.560 (2.121)
Epoch: [0][6/4642]      Time 0.569 (1.863)
Epoch: [0][7/4642]      Time 0.565 (1.677)
Epoch: [0][8/4642]      Time 0.573 (1.539)
Epoch: [0][9/4642]      Time 3.031 (1.705) <--- look at here
Epoch: [0][10/4642]     Time 0.569 (1.591)
Epoch: [0][11/4642]     Time 0.574 (1.499)
Epoch: [0][12/4642]     Time 0.565 (1.421)
Epoch: [0][13/4642]     Time 0.562 (1.355)
Epoch: [0][14/4642]     Time 0.570 (1.299)
Epoch: [0][15/4642]     Time 0.566 (1.250)
Epoch: [0][16/4642]     Time 0.560 (1.207)
Epoch: [0][17/4642]     Time 2.543 (1.286) <--- look at here

_NUM_WORKERS = 8

% Time format: time for this step (average time per step) 
Epoch: [0][1/4642]      Time 5.997 (5.997) <--- look at here
Epoch: [0][2/4642]      Time 0.554 (3.275)
Epoch: [0][3/4642]      Time 0.554 (2.368)
Epoch: [0][4/4642]      Time 0.569 (1.918)
Epoch: [0][5/4642]      Time 3.803 (2.295) <--- look at here
Epoch: [0][6/4642]      Time 0.566 (2.007)
Epoch: [0][7/4642]      Time 0.561 (1.801)
Epoch: [0][8/4642]      Time 0.565 (1.646)
Epoch: [0][9/4642]      Time 5.011 (2.020) <--- look at here

我认为理想的过程是每个步骤应花费相似的时间。 pytorch有可能吗？

用于num_workers的参数不同的Dataloader中的奇怪处理

0 个答案: