Question

我在300个随机结构的循环中运行tensorflow以找到一个好的网络结构。在数据的第一个纪元完成后，我删除了最糟糕的10％，并开始在网络上的第二个纪元。但是，它在迭代中失败~350。我在Tesla K80上运行它，内存为11.25 GiB。我也有张量流版本0.9.0和aggregation_method = tf.AggregationMethod.EXPERIMENTAL_TREE在tf.train.MomentumOptimizer上设置。以下是我得到的错误。（由于它很长，我只选择了它的启动点，细节的变化和最终的日志。

我感谢任何帮助。阿夫欣

 I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.

  I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512):       Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.

  I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024):      Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.

  I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048):      Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.

  I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096):      Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.

  I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192):   
  ......
  ......
   Bin (268435456):         Total Chunks: 0, Chunks in use: 0 0B       allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
  I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 1.8KiB was 1.0KiB, Chunk State: 
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ee0000 of size 24832
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ee6100 of size 768
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ee6400 of size 73728
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ef8400 of size 1024
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ef8800 of size 86016
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0d800 of size 768
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0db00 of size 768
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0de00 of size 256
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0df00 of size 256
  I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0e000 of size 24832
  ....
  ....
  I tensorflow/core/common_runtime/bfc_allocator.cc:692] 188 Chunks of size 313856 totalling 56.27MiB
  I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 318976 totalling 311.5KiB
  I tensorflow/core/common_runtime/bfc_allocator.cc:692] 146 Chunks of size 397824 totalling 55.39MiB
  I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 10.60GiB
  I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: 
  Limit:                 11386585088
  InUse:                 11386585088
  MaxInUse:              11386585088
  NumAllocs:               556930762
  MaxAllocSize:             30105600
  W tensorflow/core/common_runtime/bfc_allocator.cc:270]       ****************************************************************************************************
  W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 1.5KiB.  See logs for memory state.
  W tensorflow/core/framework/op_kernel.cc:899] Internal: Dst tensor is not initialized.
  E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized.
           [[Node: zeros_1931 = Const[dtype=DT_DOUBLE,       value=Tensor<type: double shape: [197] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
  54.249; ||W|| 2175.582= lmbd*||W||= 5.291; seconds= 107.76
  final; 2016-09-03 02:47:22; Iter= 10000; lr= 0.000078; l2= 0.002432; str= [43, 106, 200, 116, 1]; Train_loss= 1240.027; Test_loss= 1257.031; best_tets= 1254.249; ||W|| 2232.211= lmbd*||W||= 5.429; seconds= 116.30
  0.95 0.006917335944 0.75 0.00218294805583 0.9 9000 [43, 46, 29, 1] 64
  Traceback (most recent call last):
    File "runner.py", line 66, in <module>
      result += [dnnMultiLayerCoeff(maxiter,display,decay_rate,result[0][0],power,result[0][1],init_momentum,decay_step,result[0][2],result[0][3],batch_size,var,MaxUnImp,run_number,result[0][7],result[0][8])]
    File "/scratch/afo214/tensorflow/dnnMultiLayerCoeff.py", line 130, in dnnMultiLayerCoeff
      sess.run(tf.initialize_all_variables())
    File "/usr/local/lib/python2.7/dist-      packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
    File "/usr/local/lib/python2.7/dist-      packages/tensorflow/python/client/session.py", line 636, in _run
      feed_dict_string, options, run_metadata)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call
      raise type(e)(node_def, op, message)
  tensorflow.python.framework.errors.InternalError: Dst tensor is not initialized.
           [[Node: zeros_1931 = Const[dtype=DT_DOUBLE, value=Tensor<type: double shape: [197] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
  Caused by op u'zeros_1931', defined at:
    File "runner.py", line 64, in <module>
      result +=       [dnnMultiLayerCoeff(maxiter,display,decay_rate,starter_learning_rate,power,l2lambda,init_momentum,decay_step,NoHiLayr,node[j],batch_size,var,MaxUnImp,run_number,w,b)]
    File "/scratch/afo214/tensorflow/dnnMultiLayerCoeff.py", line 127, in dnnMultiLayerCoeff
      train_step =       tf.train.MomentumOptimizer(learning_rate,0.9).minimize(loss, global_step=global_step,aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 195, in minimize
name=name)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 297, in apply_gradients
      self._create_slots(var_list)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/momentum.py", line 51, in _create_slots
      self._zeros_slot(v, "momentum", self._name)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 491, in _zeros_slot
      named_slots[var] = slot_creator.create_zeros_slot(var, op_name)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 106, in create_zeros_slot
      val = array_ops.zeros(primary.get_shape().as_list(), dtype=dtype)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 752, in zeros
      output = constant(0, shape=shape, dtype=dtype, name=name)
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
      attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
      original_op=self._default_original_op, op_def=op_def)
    File "/usr/local/lib/python2.7/dist-      packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()

Answer 1

我通过删除每个新网络的会话对象来清除已使用的GPU内存，并且它可以正常工作。

Tensorflow：尝试分配1.5KiB时内存不足

1 个答案: