Question

我正在运行一些TensorFlow代码，可以从检查点恢复并重新开始训练。每当我从cpu构建恢复时，它似乎完全正常。但是如果我在使用gpu运行代码时尝试恢复它似乎无法正常工作。特别是我得到错误：

Traceback (most recent call last):
  File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
    large_main_hp.main_large_hp_ckpt(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
    run_hyperparam_search(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
    main_hp.main_hp(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
    with tf.Session(graph=graph) as sess:
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615

我看到它说我的内存耗尽，但是当我将内存增加到10GB时，它并没有真正改变任何东西。这只发生在我的gpu构建中，因为cpu one恢复完全正常。

无论如何，对于可能造成这种情况的原因有任何想法或想法？

gpu基本上是自动分配的，因此我不太确定可能导致它的原因或者甚至是调试它的起始步骤是什么。

完整错误：

E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
  File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
    large_main_hp.main_large_hp_ckpt(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
    run_hyperparam_search(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
    main_hp.main_hp(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
    with tf.Session(graph=graph) as sess:
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Answer 1

Tensorflow CPU利用物理和虚拟内存的优势，为您提供几乎无限的内存来操作您的模型。调试的第一步是通过简单地删除一些权重/图层并在GPU上运行来构建一个较小的模型，以确保您没有编码错误。然后慢慢增加图层/重量，直到再次耗尽内存。这将确认您在GPU上存在内存问题。我建议你最初在GPU上构建你的图形，你知道它会在你训练之后适合。如果你需要大图，那么考虑将图的一部分分配给不同的GPU（如果你有的话）。

为什么TensorFlow恢复检查点内存不足但原始脚本不会？

1 个答案: