OverflowError:PyTorch中的(34,'数值结果超出范围')

时间:2017-05-28 18:48:17

标签: pytorch

当我在不同的GPU(特斯拉K-20,安装了cuda 7.5,6GB内存)中运行我的代码时,我收到以下错误(请参阅堆栈跟踪)。如果我使用GeForce 1080或Titan X GPU,代码工作正常。

堆栈跟踪

File "code/source/main.py", line 68, in <module>
    train.train_epochs(train_batches, dev_batches, args.epochs)
  File "/gpfs/home/g/e/geniiexe/BigRed2/code/source/train.py", line 34, in train_epochs
    losses = self.train(train_batches, dev_batches, (epoch + 1))
  File "/gpfs/home/g/e/geniiexe/BigRed2/code/source/train.py", line 76, in train
    self.optimizer.step()
  File "/gpfs/home/g/e/geniiexe/BigRed2/anaconda3/lib/python3.5/site-packages/torch/optim/adam.py", line 70, in step
    bias_correction1 = 1 - beta1 ** state['step']
OverflowError: (34, 'Numerical result out of range')

那么,在GeForce或Titan X GPU上运行良好的情况下,在不同的GPU(Tesla K-20)中出现此类错误的原因是什么?而且这个错误意味着什么?它与内存溢出有关,我不这么认为。

2 个答案:

答案 0 :(得分:0)

discuss.pytorch.org中建议的一种解决方法如下:

替换adam.py中的以下行: -

bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']

BY

bias_correction1 = 1 - beta1 ** min(state['step'], 1022)
bias_correction2 = 1 - beta2 ** min(state['step'], 1022)

答案 1 :(得分:0)

如果有人像我一样来到这里,寻找相同的错误,但是使用scikit-learn的MLPClassifier的CPU,上述修复恰好是修复sklearn代码的足够好提示。

解决方法是: 在文件中... / site-packages / sklearn / neural_network / _stochastic_optimizers.py

更改此内容:

self.learning_rate = (self.learning_rate_init *
                      np.sqrt(1 - self.beta_2 ** self.t) /
                      (1 - self.beta_1 ** self.t))

对此:

orig_self_t = self.t
new_self_t = min(orig_self_t, 1022)
self.learning_rate = (self.learning_rate_init *
                          np.sqrt(1 - self.beta_2 ** new_self_t) /
                          (1 - self.beta_1 ** new_self_t))