Question

我一直在Google Cloud ML上尝试TensorFlow教程脚本。特别是我在https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10使用了cifar10 CNN教程脚本。

当我在Google Cloud ML中运行此培训脚本时，每小时内存泄漏率约为0.5％。

除了将脚本打包成所需的GCP格式（如https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer中所述）并将数据位置设置为包含.bin数据文件的存储桶之外，我没有对脚本进行任何更改。

如果我在本地运行，即不在Google Cloud中运行，并使用TCMALLOC ，则通过设置LD_PRELOAD =“/ usr / lib / libtcmalloc.so”，可以解决内存泄漏问题。但是，Google Cloud ML没有此选项。

可能导致泄漏的原因，我该怎么做才能解决这个问题？为什么其他用户没有注意到同样的问题？虽然泄漏很小，但是当我针对自己的数据运行几天时，它足以导致我的训练会话耗尽内存并失败。无论我使用多少GPU，都会发生泄漏。

我使用的gcloud命令是：

gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4

配置文件（config.yml）是：

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu

任何帮助表示赞赏，感谢。

Answer 1

我们建议使用此版本的代码：

github.com/tensorflow/models/pull/1538

具有性能优势（通过减少运行时间，您不太容易使用OOM）。

当然，这可能不是永久修复，但是，根据我们的测试，TensorFlow 1.2似乎解决了这个问题。 TensorFlow 1.2即将在CloudML Engine上推出。如果您仍有问题，请告诉我们。

TensorFlow中的内存泄漏Google Cloud ML培训

1 个答案: