回调分配GPU内存

时间:2019-02-11 18:30:55

标签: python tensorflow memory keras allocation

我编写了一个自定义的回调(BatchHistory),用于记录每批次的模型性能,而不是像默认的历史记录回调那样记录每个时期的模型性能。

我将BatchHistory对象存储为pickle文件,以便以后可以访问确切的培训历史记录。但是我观察到

1)回调对象的泡菜比仅腌制logs字段的泡菜大10倍,并且

2)取消选择BatchHistory对象时,正在分配GPU内存。

我不明白为什么会这样。我研究了source中的回调,这些基本上是简单的类,与keras模型没有逻辑联系。那么GPU内存分配来自何处,为什么pickle文件如此之大,却与实际记录的数据无关?该模型中必须有一些数据经过训练,并且回调与绑定到其上被腌制的Callback对象相关联,从而导致腌制文件很大。是这样吗如果是这样:负责的代码在源代码中的原因和位置。

这是我在GPU已经使用过多时取消拨回回调时遇到的一个OOM错误:

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
~/anaconda3/envs/neucores/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~/anaconda3/envs/neucores/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~/anaconda3/envs/neucores/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

ResourceExhaustedError: OOM when allocating tensor with shape[8704,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training/Adam/Variable_30/Assign}} = Assign[T=DT_FLOAT, _grappler_relax_allocator_constraints=true, use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/Variable_30, training/Adam/zeros_12)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

这是我的回调类。但是我认为我的代码与他无关。它必须是关于基类的。但是正如我所说,在源代码中我找不到任何可能导致GPU内存分配的东西。

class BatchHistory(Callback):

    def __init__(self):
        super().__init__()
        self.logs =  {'loss' : [],
                      'acc' : [],
                      'val_acc' : [],
                      'val_loss' : [],
                      'epoch_cnt' : 0,
                      'epoch_ends' : [],
                      'time_elapsed' : 0 # seconds
                     }
        self.start_time = time.time() 

    def on_train_begin(self, logs={}):
        pass

    def on_batch_end(self, batch, logs={}):
        self.logs['acc'].append(logs.get('acc'))
        self.logs['loss'].append(logs.get('loss'))
        self.logs['time_elapsed']=int(time.time()-self.start_time)

    def on_epoch_end(self, epochs, logs=None):
        self.logs['epoch_cnt']+=1
        self.logs['epoch_ends'].append(len(self.logs['loss']))
        self.logs['val_acc'].append(logs.get('val_acc'))
        self.logs['val_loss'].append(logs.get('val_loss'))

0 个答案:

没有答案