Question

我有一个运行在redhat 7.7服务器上的Django项目，该服务器是用于训练spacy模型的服务。
（-python v2.7.5-spacy v2.0.18 -django v1.11.29 -djangorestframework v3.9.4 -django -apscheduler v0.3.0）

在训练作业执行3-10次迭代后，Redhat杀死了整个django过程。代码：

def train_spacy_model(model_id, storage_path, file_id, call_back_url):
    status = "fail"
    try:
        if os.path.isfile(os.path.join(storage_path, "spaCy_model_" + model_id + ".pkl")):
            pickle_file: BinaryIO = open(os.path.join(storage_path, "spaCy_model_" + model_id + ".pkl"), "rb")
            nlp = pickle.load(pickle_file)
        else:
            nlp = spacy.load("en_core_web_sm")

        pickle_file: BinaryIO = open(os.path.join(storage_path, "dataset_" + model_id + ".pkl"), "rb")
        training_data = pickle.load(pickle_file)
        pickle_file.close()
        ner = nlp.get_pipe('ner')

        for _, annotations in training_data:
            for ent in annotations.get('entities'):
                ner.add_label(ent[2])

        other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
        with nlp.disable_pipes(*other_pipes):
            optimizer = nlp.begin_training()
            for i in range(100):
                print("Starting iteration " + str(i))
                print(sys.getrefcount(nlp))
                losses = {}
                for text, annotations in training_data:
                    nlp.update(
                        [text],
                        [annotations],
                        drop=0.2,
                        sgd=optimizer,
                        losses=losses
                    )

                print(losses)

            fname = os.path.join(storage_path, "spaCy_model_" + model_id + ".pkl")
            with open(fname, "wb") as f:
                pickle.dump(nlp, f)
            status = "success"
            os.remove(os.path.join(storage_path, "dataset_" + model_id + ".pkl"))
    except Exception as e:
        print(e)

    train_call_back(model_id, file_id, call_back_url, status)

redhat日志（/ var / log / messages）中的错误代码是：

Jun 15 11:12:26 AZSG-D-CTBT03 kernel: Out of memory: Kill process 20901 (python) score 618 or sacrifice child
Jun 15 11:12:26 AZSG-D-CTBT03 kernel: Killed process 20901 (python) total-vm:24366028kB, anon-rss:22261284kB, file-rss:0kB, shmem-rss:0kB

该服务器有32G RAM，平均大约有19G可用空间。使用“ top”命令时，该进程在被杀死之前达到了接近60-65％的内存使用率，这意味着它在被杀死之前几乎使用了所有可用的RAM。
在机器上作为普通脚本运行时，相同的代码不会被杀死，这使我最困惑。如果代码正在泄漏内存，那么它也不应该像普通脚本一样完成执行，对吗？

Django在Redhat服务器上被杀死（内存不足）

0 个答案: