我有一个运行在redhat 7.7服务器上的Django项目,该服务器是用于训练spacy模型的服务。
(-python v2.7.5-spacy v2.0.18 -django v1.11.29 -djangorestframework v3.9.4 -django -apscheduler v0.3.0)
在训练作业执行3-10次迭代后,Redhat杀死了整个django过程。代码:
def train_spacy_model(model_id, storage_path, file_id, call_back_url):
status = "fail"
try:
if os.path.isfile(os.path.join(storage_path, "spaCy_model_" + model_id + ".pkl")):
pickle_file: BinaryIO = open(os.path.join(storage_path, "spaCy_model_" + model_id + ".pkl"), "rb")
nlp = pickle.load(pickle_file)
else:
nlp = spacy.load("en_core_web_sm")
pickle_file: BinaryIO = open(os.path.join(storage_path, "dataset_" + model_id + ".pkl"), "rb")
training_data = pickle.load(pickle_file)
pickle_file.close()
ner = nlp.get_pipe('ner')
for _, annotations in training_data:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(100):
print("Starting iteration " + str(i))
print(sys.getrefcount(nlp))
losses = {}
for text, annotations in training_data:
nlp.update(
[text],
[annotations],
drop=0.2,
sgd=optimizer,
losses=losses
)
print(losses)
fname = os.path.join(storage_path, "spaCy_model_" + model_id + ".pkl")
with open(fname, "wb") as f:
pickle.dump(nlp, f)
status = "success"
os.remove(os.path.join(storage_path, "dataset_" + model_id + ".pkl"))
except Exception as e:
print(e)
train_call_back(model_id, file_id, call_back_url, status)
redhat日志(/ var / log / messages)中的错误代码是:
Jun 15 11:12:26 AZSG-D-CTBT03 kernel: Out of memory: Kill process 20901 (python) score 618 or sacrifice child
Jun 15 11:12:26 AZSG-D-CTBT03 kernel: Killed process 20901 (python) total-vm:24366028kB, anon-rss:22261284kB, file-rss:0kB, shmem-rss:0kB
该服务器有32G RAM,平均大约有19G可用空间。使用“ top”命令时,该进程在被杀死之前达到了接近60-65%的内存使用率,这意味着它在被杀死之前几乎使用了所有可用的RAM。
在机器上作为普通脚本运行时,相同的代码不会被杀死,这使我最困惑。如果代码正在泄漏内存,那么它也不应该像普通脚本一样完成执行,对吗?