我正在使用张量流tf.train_and_evaluate
和Google的cloud ai作业系统一起训练tf.estimator.Estimator
模型。
最近,当我去训练模型时,我在训练时遇到CUDA_ERROR_OUT_OF_MEMORY
错误,但是我注意到这仅发生在evaluation
阶段。即我可以按照任意数量的步骤进行训练,但是一旦训练阶段结束,我就会看到错误。
我已在下面复制并粘贴了确切的错误(连续存在多个错误):
failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY:
out of memory
could not allocate pinned host memory of size: 8589934592
failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY:
out of memory could not allocate pinned host memory of size:
7730940928 failed to alloc 6957846528 bytes on host:
CUDA_ERROR_OUT_OF_MEMORY: out of memory could not allocate pinned host
memory of size: 6957846528 failed to alloc 6262061568 bytes on host:
CUDA_ERROR_INVALID_VALUE: invalid argument could not allocate pinned
host memory of size: 6262061568 failed to alloc 5635855360 bytes on
host: CUDA_ERROR_INVALID_VALUE: invalid argument could not allocate
pinned host memory of size: 5635855360