我在300个随机结构的循环中运行tensorflow以找到一个好的网络结构。 在数据的第一个纪元完成后,我删除了最糟糕的10%,并开始在网络上的第二个纪元。但是,它在迭代中失败~350。 我在Tesla K80上运行它,内存为11.25 GiB。我也有张量流版本0.9.0和aggregation_method = tf.AggregationMethod.EXPERIMENTAL_TREE在tf.train.MomentumOptimizer上设置。 以下是我得到的错误。 (由于它很长,我只选择了它的启动点,细节的变化和最终的日志。
我感谢任何帮助。 阿夫欣
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192):
......
......
Bin (268435456): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 1.8KiB was 1.0KiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ee0000 of size 24832
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ee6100 of size 768
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ee6400 of size 73728
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ef8400 of size 1024
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303ef8800 of size 86016
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0d800 of size 768
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0db00 of size 768
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0de00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0df00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x2303f0e000 of size 24832
....
....
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 188 Chunks of size 313856 totalling 56.27MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 318976 totalling 311.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 146 Chunks of size 397824 totalling 55.39MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 10.60GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit: 11386585088
InUse: 11386585088
MaxInUse: 11386585088
NumAllocs: 556930762
MaxAllocSize: 30105600
W tensorflow/core/common_runtime/bfc_allocator.cc:270] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 1.5KiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:899] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized.
[[Node: zeros_1931 = Const[dtype=DT_DOUBLE, value=Tensor<type: double shape: [197] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
54.249; ||W|| 2175.582= lmbd*||W||= 5.291; seconds= 107.76
final; 2016-09-03 02:47:22; Iter= 10000; lr= 0.000078; l2= 0.002432; str= [43, 106, 200, 116, 1]; Train_loss= 1240.027; Test_loss= 1257.031; best_tets= 1254.249; ||W|| 2232.211= lmbd*||W||= 5.429; seconds= 116.30
0.95 0.006917335944 0.75 0.00218294805583 0.9 9000 [43, 46, 29, 1] 64
Traceback (most recent call last):
File "runner.py", line 66, in <module>
result += [dnnMultiLayerCoeff(maxiter,display,decay_rate,result[0][0],power,result[0][1],init_momentum,decay_step,result[0][2],result[0][3],batch_size,var,MaxUnImp,run_number,result[0][7],result[0][8])]
File "/scratch/afo214/tensorflow/dnnMultiLayerCoeff.py", line 130, in dnnMultiLayerCoeff
sess.run(tf.initialize_all_variables())
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Dst tensor is not initialized.
[[Node: zeros_1931 = Const[dtype=DT_DOUBLE, value=Tensor<type: double shape: [197] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'zeros_1931', defined at:
File "runner.py", line 64, in <module>
result += [dnnMultiLayerCoeff(maxiter,display,decay_rate,starter_learning_rate,power,l2lambda,init_momentum,decay_step,NoHiLayr,node[j],batch_size,var,MaxUnImp,run_number,w,b)]
File "/scratch/afo214/tensorflow/dnnMultiLayerCoeff.py", line 127, in dnnMultiLayerCoeff
train_step = tf.train.MomentumOptimizer(learning_rate,0.9).minimize(loss, global_step=global_step,aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 195, in minimize
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 297, in apply_gradients
self._create_slots(var_list)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/momentum.py", line 51, in _create_slots
self._zeros_slot(v, "momentum", self._name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 491, in _zeros_slot
named_slots[var] = slot_creator.create_zeros_slot(var, op_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 106, in create_zeros_slot
val = array_ops.zeros(primary.get_shape().as_list(), dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 752, in zeros
output = constant(0, shape=shape, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()
答案 0 :(得分:0)
我通过删除每个新网络的会话对象来清除已使用的GPU内存,并且它可以正常工作。