Question

我使用tensorflow实现了一个网络。网络通过4个GPU进行培训。当我点击 ctrl + c 时，程序崩溃了nvidia驱动程序并创建了名为“python”的僵尸进程。我无法杀死僵尸进程，也无法通过sudo reboot重新启动ubuntu系统。

我正在使用FIFO队列和线程从二进制文件中读取数据。

coord = tf.train.Coordinator()
t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord))
t.start()

我致电sess.close()后，程序将不会停止，我看到了：

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=4033 evicted_count=3000 eviction_rate=0.743863 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=14033 evicted_count=13000 eviction_rate=0.926388 and unsatisfied allocation rate=0

似乎没有发布GPU资源。如果我打开另一个终端，nvidia-smi命令将无效。然后我必须通过以下方式残酷地重启系统：

#echo 1 > /proc/sys/kernel/sysrq
#echo b > /proc/sysrq-trigger

我知道sess.close可能太残酷了。所以我尝试使用dequeue操作清空FIFO队列，然后：

while iteration < 10000:
  GPU training...

#training finished

coord.request_stop()
while sess.run(queue_size) > 0:
  sess.run(dequeue_one_element_op)
  print('queue_size='+str(sess.run(get_queue_size_op)))
  time.sleep(1)
coord.join([t])
print('finished join t')

此方法也不起作用。基本上，程序在达到最大训练迭代后无法终止。

Answer 1

https://github.com/tensorflow/tensorflow/issues/658

这解决了这个问题：

export CUDA_VISIBLE_DEVICES=0

如何安全地终止在多个GPU上运行的tensorflow程序

1 个答案: