在第 14 个时代在 Imagenet 上训练 Resnet50 时出错

时间:2021-01-11 14:21:39

标签: python pytorch imagenet pytorch-dataloader

我正在使用 PyTorch 提供的脚本在 imagenet 上训练 Resnet50(为了我的目的进行了轻微的微调)。但是,经过 14 次训练后,我收到以下错误。我在用来运行它的服务器中分配了 4 个 GPU。任何有关此错误是什么的指针将不胜感激。非常感谢!

Epoch: [14][5000/5005]  Time 1.910 (2.018)  Data 0.000 (0.191)  Loss 2.6954 (2.7783)    Total 2.6954 (2.7783)   Reg 0.0000  Prec@1 42.969 (40.556)  Prec@5 64.844 (65.368)   
Test: [0/196]   Time 86.722 (86.722)    Loss 1.9551 (1.9551)    Prec@1 51.562 (51.562)  Prec@5 81.641 (81.641)
Traceback (most recent call last):
  File "main_group.py", line 549, in <module>
  File "main_group.py", line 256, in main
    
  File "main_group.py", line 466, in validate
    if args.gpu is not None:
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
    return self._process_data(data)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 11.
Original Traceback (most recent call last):
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
    sample = self.loader(path)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
    return pil_loader(path)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 155, in pil_loader
    with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'

1 个答案:

答案 0 :(得分:1)

仅通过查看您发布的错误很难判断问题所在。

我们只知道在 '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG' 读取文件时出现问题。

尝试以下操作:

  1. 确认文件确实存在。
  2. 确认它实际上是一个有效的 JPEG 并且没有损坏(通过查看它)。
  3. 确认您可以使用 Python 打开它,也可以使用 PIL 手动加载它。
  4. 如果这些都不起作用,请尝试删除该文件。您是否在文件夹中的另一个文件上遇到同样的错误?