培训过程很奇怪。加载数据部分对于特定步骤将花费很长时间,而在接下来的几个步骤中,它变得非常快。此过程根据num_workers
的参数一次又一次地重复。这正常吗?还是有任何类似于tensorflow prefech()
part的方法?
环境:Pytorch,Ubuntu16.04,使用1个GPU / 2个GPU时存在问题
_NUM_WORKERS = 8
train_loader = DataLoader(
trainset,
batch_size=config.batch_size,
shuffle=True,
num_workers=_NUM_WORKERS,
pin_memory=True)
_NUM_WORKERS = 8
时:
% Time format: time for this step (average time per step)
Epoch: [0][1/4642] Time 8.363 (8.363) <--- look at here
Epoch: [0][2/4642] Time 0.557 (4.460)
Epoch: [0][3/4642] Time 0.564 (3.161)
Epoch: [0][4/4642] Time 0.562 (2.512)
Epoch: [0][5/4642] Time 0.560 (2.121)
Epoch: [0][6/4642] Time 0.569 (1.863)
Epoch: [0][7/4642] Time 0.565 (1.677)
Epoch: [0][8/4642] Time 0.573 (1.539)
Epoch: [0][9/4642] Time 3.031 (1.705) <--- look at here
Epoch: [0][10/4642] Time 0.569 (1.591)
Epoch: [0][11/4642] Time 0.574 (1.499)
Epoch: [0][12/4642] Time 0.565 (1.421)
Epoch: [0][13/4642] Time 0.562 (1.355)
Epoch: [0][14/4642] Time 0.570 (1.299)
Epoch: [0][15/4642] Time 0.566 (1.250)
Epoch: [0][16/4642] Time 0.560 (1.207)
Epoch: [0][17/4642] Time 2.543 (1.286) <--- look at here
_NUM_WORKERS = 8
% Time format: time for this step (average time per step)
Epoch: [0][1/4642] Time 5.997 (5.997) <--- look at here
Epoch: [0][2/4642] Time 0.554 (3.275)
Epoch: [0][3/4642] Time 0.554 (2.368)
Epoch: [0][4/4642] Time 0.569 (1.918)
Epoch: [0][5/4642] Time 3.803 (2.295) <--- look at here
Epoch: [0][6/4642] Time 0.566 (2.007)
Epoch: [0][7/4642] Time 0.561 (1.801)
Epoch: [0][8/4642] Time 0.565 (1.646)
Epoch: [0][9/4642] Time 5.011 (2.020) <--- look at here
我认为理想的过程是每个步骤应花费相似的时间。 pytorch有可能吗?