我希望我的应用程序自动查找可能的最大批处理大小,因此我设置了一个二进制搜索来捕获如下OOM异常:
min_batch_size = 3072
max_batch_size = 16384
min_range = 512 # Exit when the search range is smaller than this.
sample_iterations = 5
while max_batch_size - min_batch_size > min_range:
batch_size = (max_batch_size + min_batch_size) // 2
with tf.Graph().as_default() as graph:
train_op = build_graph()
try:
with tf.Session(graph=graph) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
for _ in range(sample_iterations):
sess.run(train_op)
except tf.errors.ResourceExhaustedError:
max_batch_size = batch_size
else:
min_batch_size = batch_size
return min_batch_size
使用正式的TensorFlow 1.12 Docker映像,它开始收敛到实际的批量大小,而丢弃的大小太大而无法安装在当前硬件上。但是,它在第6次迭代时因此内部错误而失败:
File "opennmt/runner.py", line 252, in _auto_tune_batch_size
with tf.Session(graph=graph, config=session_config) as sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: an illegal memory access was encountered
鉴于上述过程,此错误的原因是什么? OOM错误通常不安全地从中恢复吗?如何解决该问题以在应用程序内而不是通过运行脚本实现我的最初目标?
谢谢!