为什么tensorflow1.1在训练时变得越来越慢?是内存泄漏还是队列饥饿?

时间:2017-04-06 08:55:51

标签: memory-leaks tensorflow

我在tensorflow1.1中训练了ESPCN,训练时每个补丁的计算时间几乎呈线性增长。前100个时期仅需4-5秒,但第70个时期大约需要半分钟。请参阅以下培训结果:

image image

我在Google和Stack-overflow上搜索了同样的问题,并尝试了以下解决方案,但似乎没有用: 1. tf.reset_default_graph()之后sess.run(); time.sleep(5) 2.add L3, var_w_list, var_b_list = model_train(IN, FLAGS) cost = tf.reduce_mean(tf.reduce_sum(tf.square(OUT - L3), reduction_indices=0)) global_step = tf.Variable(0, trainable=False) learning_rate = tf.train.exponential_decay(FLAGS.base_lr, global_step * FLAGS.batch_size, FLAGS.decay_step, 0.96, staircase=True) optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost, global_step = global_step, var_list = var_w_list + var_b_list) # optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost, var_list = var_w_list + var_b_list) cnt = 0 with tf.Session() as sess: init_op = tf.initialize_all_variables() sess.run(init_op) saver = tf.train.Saver() ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir) print('\n\n\n =========== All initialization finished, now training begins ===========\n\n\n') t_start = time.time() t1 = t_start for i in range(1, FLAGS.max_Epoch + 1): LR_batch, HR_batch = batch.__next__() global_step += 1 [_, cost1] = sess.run([optimizer, cost], feed_dict = {IN: LR_batch, OUT: HR_batch}) # tf.reset_default_graph() if i % 100 == 0 or i == 1: print_step = i print_loss = cost1 / FLAGS.batch_size test_LR_batch, test_HR_batch = test_batch.__next__() test_SR_batch = test_HR_batch.copy() test_SR_batch[:,:,:,0:3] = sess.run(L3, feed_dict = {IN: test_LR_batch[:,:,:,0:3]}) # tf.reset_default_graph() psnr_tmp = 0.0 ssim_tmp = 0.0 for k in range(test_SR_batch.shape[0]): com1 = test_SR_batch[k, :, :, 0] com2 = test_HR_batch[k, :, :, 0] psnr_tmp += get_psnr(com1, com2, FLAGS.HR_size, FLAGS.HR_size) ssim_tmp += get_ssim(com1, com2, FLAGS.HR_size, FLAGS.HR_size) psnr[cnt] = psnr_tmp / test_SR_batch.shape[0] ssim[cnt] = ssim_tmp / test_SR_batch.shape[0] ep[cnt] = print_step t2 = time.time() print_time = t2 - t1 t1 = t2 print(("[Epoch] : {0:d} [Current cost] : {1:5.8f} \t [Validation PSNR] : {2:5.8f} \t [Duration time] : {3:10.8f} s \n").format(print_step, print_loss, psnr[cnt], print_time)) # tf.reset_default_graph() cnt += 1 if i % 1000 == 0: L3_test = model_test(IN_TEST, var_w_list, var_b_list, FLAGS) output_img = single_HR.copy() output_img[:,:,:,0:3] = sess.run(L3_test, feed_dict = {IN_TEST:single_LR[:,:,:,0:3]}) tf.reset_default_graph() subname = FLAGS.img_save_dir + '/' + str(i) + ".jpg" img_gen(output_img[0,:,:,:], subname) print(('================= Saving model to {}/model.ckpt ================= \n').format(FLAGS.checkpoint_dir)) time.sleep(5) # saver.save(sess, FLAGS.checkpoint_dir + '/model.ckpt', print_step) t_tmp = time.time() - t_start 以防止队列饥饿;

我知道一般的想法,那就是减少Session()中的操作。但是怎么样?有人有解决方案吗?

这是我的代码的一部分:

def _phase_shift(I, r):
    bsize, a, b, c = I.get_shape().as_list()
    bsize = tf.shape(I)[0] # Handling Dimension(None) type for undefined batch dim
    X = tf.reshape(I, (bsize, a, b, r, r))
    X = tf.transpose(X, (0, 1, 2, 4, 3))  # bsize, a, b, 1, 1
    X = tf.split(X, a, 1)  # a, [bsize, b, r, r]
    X = tf.concat([tf.squeeze(x, axis=1) for x in X], 2)  # bsize, b, a*r, r
    X = tf.split(X, b, 1)  # b, [bsize, a*r, r]
    X = tf.concat([tf.squeeze(x, axis=1) for x in X], 2)  # bsize, a*r, b*r
    return tf.reshape(X, (bsize, a*r, b*r, 1))

def PS(X, r, color=False):
    if color:
        Xc = tf.split(X, 3, 3)
        X = tf.concat([_phase_shift(x, r) for x in Xc], 3)
    else:
        X = _phase_shift(X, r)
    return X

我的配置是:windows10 + tf1.1 + python3.5 + cuda8.0 + cudnn5.1

=============================================== =================

此外,我在最后一层使用了像素混洗(PS)层而不是反卷积。我复制了其他人的PS代码,如下所示:

X

其中r是4维图像张量,color表示放大因子tf.nn.relu()确定图像通道是3(Ycbcr格式)还是1(灰度)格式)。

使用图层非常简单,就像L3_ps = PS(L3, scale, True) 一样:

public class KpiRepository : IKpiRepository
{
    private readonly kpiContext _context;

    public KpiRepository(kpiContext context)
    {
        _context = context;
    }

    public Task<int> NonCompliantTotalCountAsync()
    {
        return _context.vw_SearchData
            .CountAsync(x => x.NonComplianceDate < DateTime.Today || x.DateDueOutOfService < DateTime.Today);
    }

    public Task<int> NonCompliantInNextWeekCountAsync()
    {
        return NonCompliantBetweenNowAndDateCountAsync(DateTime.Today.AddDays(7));
    }

    public Task<int> NonCompliantInNextMonthCountAsync()
    {
        return NonCompliantBetweenNowAndDateCountAsync(DateTime.Today.AddMonths(1));
    }

    private Task<int> NonCompliantBetweenNowAndDateCountAsync(DateTime endDate)
    {
        return _context.vw_SearchData
            .CountAsync(x => (x.NonComplianceDate >= DateTime.Today && x.NonComplianceDate <= endDate)
                || (x.DateDueOutOfService >= DateTime.Today && x.DateDueOutOfService <= endDate));
    }

    var taskNonCompliantInNextWeekCountAsync = NonCompliantInNextWeekCountAsync();
    var taskNonCompliantInNextMonthCountAsync = NonCompliantInNextMonthCountAsync();  

    Task.WaitAll(
        taskNonCompliantInNextWeekCountAsync,
        taskNonCompliantInNextMonthCountAsync
    );

    data.NonCompliantInNextWeekCount= taskNonCompliantInNextWeekCountAsync.Result;
    data.NonCompliantInNextMonthCount= taskNonCompliantInNextMonthCountAsync.Result;

现在我想知道这一层是否会导致减速,因为使用反卷积层时程序运行良好。使用反卷积层可能是一种解决方案,但出于某种原因我必须使用PS层。

1 个答案:

答案 0 :(得分:0)

我怀疑此行导致内存泄漏(虽然没有看到代码,但我无法肯定地说):

L3_test = model_test(IN_TEST, var_w_list, var_b_list, FLAGS)

L3_test似乎是tf.Tensor(因为您稍后将其传递给sess.run(),因此model_test()似乎每次都会向图表添加新节点被称为(每1000个步骤),这将导致更多的工作随着时间的推移。

解决方案非常简单:由于model_test()不依赖于训练循环中计算的任何内容,因此您可以将调用移到训练循环之外,因此只调用一次。