我正在尝试使用 DeepSpeed(Microsoft 的训练优化库)训练图形注意力神经网络,但无法理解错误?

时间:2021-07-08 04:24:04

标签: deep-learning pytorch google-colaboratory

我正在尝试使用 Microsoft 的 DeepSpeed 库 (https://www.microsoft.com/en-us/research/project/deepspeed/) 训练神经网络。作为参考,我早些时候遇到了一个关于 Pytorch 和 Google Colab 的 cuda 版本不匹配的错误,因此我不得不将 colab 的 cuda 版本降级到 10.2。现在,我面临以下错误,我知道这是一些内存分配错误,但无法解决此问题。

Loading extension module utils...
Time to load utils op: 0.4091193675994873 seconds
[b8c4d204706d:21593] *** Process received signal ***
[b8c4d204706d:21593] Signal: Segmentation fault (11)
[b8c4d204706d:21593] Signal code: Address not mapped (1)
[b8c4d204706d:21593] Failing at address: 0x7f917f55620d
[b8c4d204706d:21593] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f9181df8980]
[b8c4d204706d:21593] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f9181a378a5]
[b8c4d204706d:21593] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f91822a2e44]
[b8c4d204706d:21593] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f9181a38735]
[b8c4d204706d:21593] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f91822a0cb3]
[b8c4d204706d:21593] *** End of error message ***

0 个答案:

没有答案