使用Tensorflow的队列,GPU上的负载逐渐减少

时间:2017-12-07 00:44:37

标签: tensorflow queue tensorflow-gpu

我的硬盘上存有大量的腌制数据。我有一个生成器功能,可以批量读取这些pickle文件( batch_size = 512 ),并且我使用tensorflow的队列加快这个过程。目前,我的 queue_size是4096 ,我使用 6个线程作为我的6个物理核心。当我运行代码并监控我的GPU负载时(我使用TitanX),在开始时看起来没问题:

enter image description here

但是随着时间的推移,我看到GPU的负载减少了:

enter image description here

我还看到每个时期的执行时间增加:

大纪元1 | Exec的。时间:1646.523872

大纪元2 | Exec的。时间:1760.770192

大纪元3 | Exec的。时间:1861.450039

大纪元4 | Exec的。时间:1952.52812

大纪元5 | Exec的。时间:2167.598431

大纪元6 | Exec的。时间:2278.203603

大纪元7 | Exec的。时间:2320.280606

大纪元8 | Exec的。时间:2467.036160

大纪元9 | Exec的。时间:2584.932837

大纪元10 | Exec的。时间:2736.121618

...

大纪元20 | Exec的。时间:3841.635191

我观察到的GPU加载解释了它。

现在,问题是为什么会发生这种情况?这是tensorflow队列中的错误吗?我做错了什么?!我使用tensorflow ver。 1.4如果它有助于我定义队列,排队和出队:

def get_train_queue(batch_size, data_generator, queue_size, num_threads):
   # get train queue to parallelize loading data
   q = tf.FIFOQueue(capacity = queue_size, dtypes = [tf.float32, tf.float32, tf.float32, tf.float32],
                 shapes = [[batch_size, x_height, x_width, num_channels],
                 [batch_size, num_classes],
                 [batch_size, latent_size],
                 [batch_size]])

   batch = next(data_generator)
   batch_z = np.random.uniform(-1.0, 1.0, size = (batch_size, latent_size))
   mask = get_labled_mask(labeled_rate, batch_size)

   enqueue_op = q.enqueue((batch[0], batch[1], batch_z, mask))    
   qr = tf.train.QueueRunner(q, [enqueue_op] * num_threads)
   tf.train.add_queue_runner(qr)

   return q

def train_per_batch(sess, q, train_samples_count, batch_size, parameters, epoch):
   # train_per_batch and get train loss and accuracy
   t_total = 0
    for iteration in range(int(train_samples_count / batch_size)):
       t_start = time.time()
       data = q.dequeue()  
       feed_dictionary = {parameters['x']: sess.run(data[0]),
                          parameters['z']: sess.run(data[2]),
                          parameters['label']: sess.run(data[1]),
                          parameters['labeled_mask']: sess.run(data[3]),
                          parameters['dropout_rate']: dropout,
                          parameters['d_init_learning_rate']: D_init_learning_rate,
                          parameters['g_init_learning_rate']: G_init_learning_rate,
                          parameters['is_training']: True}

       sess.run(parameters['D_optimizer'], feed_dict = feed_dictionary)
       sess.run(parameters['G_optimizer'], feed_dict = feed_dictionary)

       train_D_loss = sess.run(parameters['D_L'], feed_dict = feed_dictionary)
       train_G_loss = sess.run(parameters['G_L'], feed_dict = feed_dictionary)            
       t_total += (time.time() - t_start)

我也尝试过tensorflow 1.4推荐的tf.data.Dataset.from_generator()

train_dataset = tf.data.Dataset.from_generator(data_generator_training_from_pickles, 
                                               (tf.float32, tf.float32, tf.float32, tf.float32, tf.float32, tf.float32), 
                                               ([batch_size, x_height, x_width, num_channels],
                                                [batch_size, num_classes],
                                                [batch_size, latent_size],
                                                [batch_size])))

然后使用:

def train_per_batch(sess, train_dataset, train_samples_count, batch_size, parameters, epoch):
   # train_per_batch and get train loss and accuracy
   t_total = 0
    for iteration in range(int(train_samples_count / batch_size)):
       t_start = time.time()
       data = train_dataset.make_one_shot_iterator().get_next()  
       feed_dictionary = {parameters['x']: sess.run(data[0]),
                          parameters['z']: sess.run(data[2]),
                          parameters['label']: sess.run(data[1]),
                          parameters['labeled_mask']: sess.run(data[3]),
                          parameters['dropout_rate']: dropout,
                          parameters['d_init_learning_rate']: D_init_learning_rate,
                          parameters['g_init_learning_rate']: G_init_learning_rate,
                          parameters['is_training']: True}

       sess.run(parameters['D_optimizer'], feed_dict = feed_dictionary)
       sess.run(parameters['G_optimizer'], feed_dict = feed_dictionary)

       train_D_loss = sess.run(parameters['D_L'], feed_dict = feed_dictionary)
       train_G_loss = sess.run(parameters['G_L'], feed_dict = feed_dictionary)            
       t_total += (time.time() - t_start)

这是最糟糕的。显然没有排队:

enter image description here

0 个答案:

没有答案