Question

我正在通过Tensorflow在MNIST中应用机器学习。我在一个集群上执行此操作，其中每个节点都运行Tensorflow的分布式执行。我通过主节点上的bash脚本运行单独的执行。此主节点使用ssh连接到群集中的一组节点，然后运行运行Tensorflow的Python脚本。

当Tensorflow在节点上运行时，我经常收到以下错误消息，导致节点崩溃：

var canvas = document.getElementById('canvas');
var ctx = canvas.getContext('2d');

this.buildBox = function(xpos, ypos, width, height, text, colorFlag) {
    ctx.font = '12px Arial';
    ctx.strokeStyle = 'black';
    ctx.fillStyle = colorFlag === true ? '#ff6666' : '#fff';
    ctx.rect(xpos, ypos, width, height);
    ctx.fill();
    ctx.stroke();

    function textFill(text, offset) {
    }

    if(text !== null && text !== undefined) {
      ctx.strokeText(text, xpos + 10, ypos + 20);
    } 
}
// test data
this.buildBox(84,10,64,44, 888, true);
this.buildBox(84,64,64,44, 999, false);
this.buildBox(84,118,64,44, 777, true);

这是因为内存不足。当我登录节点检查内存时，我发现可用内存非常低。问题是当节点完成时节点上的内存不会被释放（或者当超时时被主bash脚本杀死）。

在退出Tensorflow应用程序后，是否有一种简单的方法来清理节点的内存？我没有任何sudo权限。

Answer 1

您是否尝试过关闭会话？我相信它是sess.close（）。

Answer 2

我从AWS（https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/）的示例中获得了一些灵感。

在每个worker和参数服务器主机上运行python脚本之前运行pkill -f python解决了这个问题。

退出分布式Tensorflow执行后内存不足

2 个答案: