Question

我正在尝试在AWS Sagemaker中训练PyTorch FLAIR模型。这样做时出现以下错误：

RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB (GPU 0; 11.17 GiB total capacity; 9.29 GiB already allocated; 7.31 MiB free; 10.80 GiB reserved in total by PyTorch)

在培训中，我使用了sagemaker.pytorch.estimator.PyTorch类。

我尝试使用实例类型从ml.m5，g4dn到p3的不同变体（甚至具有96GB的内存）。在ml.m5中，出现CPUmemoryIssue错误，在g4dn中出现GPUMemoryIssue错误，在P3中出现GPUMemoryIssue错误，主要是因为Pytorch仅使用8 * 12GB中的12GB GPU之一。

即使在本地尝试使用CPU机器并出现以下错误，也无法完成本培训：

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes. Buy new RAM!

模型训练脚本：

    corpus = ClassificationCorpus(data_folder, test_file='../data/exports/val.csv', train_file='../data/exports/train.csv')
                                          
    print("finished loading corpus")

    word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]

    document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)

    classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)

    trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

    trainer.train('../model_files', max_epochs=12,learning_rate=0.0001, train_with_dev=False, embeddings_storage_mode="none")

P.S .：我能够在具有4GB GTX 1650 DDR5内存的本地GPU机器中使用较小的数据集训练相同的体系结构，而且速度非常快。

Answer 1

此错误是因为您的GPU内存不足。你可以尝试几件事

减小训练数据的大小
减小模型的大小，即隐藏层的数量或深度
您也可以尝试减小批量大小

Answer 2

好的，因此在连续调试两天后，才能够找出根本原因。我了解的是，Flair对句子的长度没有任何限制，就字数而言，它以句子的最长长度为最大。因此，这引起了问题，因为在我的情况下，很少有15万行的内容，以至于无法将嵌入内容加载到内存中，甚至16GB GPU也是如此。这样就破了。

要解决此问题：对于包含这么长单词的内容，您可以从任意位置（左/右/中间任意位置）的此类内容中提取n个单词（在我的情况下为10K））并整理其余的记录，或者如果比较计数非常少，则可以忽略这些记录进行训练。

在此之后，我希望您能够像我这样进行培训。

P.S .：如果您关注此主题并遇到类似问题，请随时提出评论，以便我们为您解决此问题。

训练时Pytorch CUDA内存不足错误

2 个答案: