在AWS实例g2.2xlarge

时间:2018-01-25 06:08:23

标签: amazon-web-services memory tensorflow amazon-ec2

我正在AWS实例g2.2xlarge上运行卷积神经网络。该型号可以运行30000张尺寸为64x64的图像。但是,当我尝试使用大小为128x128的图像运行它时,即使我只输入1个图像(有2个通道 - 实数和虚数),它也会出现内存错误(见下文)。
因为错误提到了形状的张量[32768,16384],我认为它发生在第一个(完全连接)层,它采用两个通道128 * 128 * 2 = 32768的输入图像,并输出128 * 128 = 16384矢量。 我找到了减少批量大小的建议,但是,我只使用了1个输入图像 Here据说使用cudnn可以在我使用的同一个AWS实例上达到700-900px(尽管我不知道它们是否使用完全连接的层)。我尝试了两个不同的AMI(12),两者都安装了cudnn,但仍然出现内存错误。

我的问题是:
1.如何计算[32768,16384]张量需要多少内存?我不是计算机科学家,所以我希望得到详细的答复 2.我想我试图理解我使用的实例是否真的对我的数据有太少的内存(g2.2xlarge有15 GiB)或者我只是做错了什么。

错误:

2018-01-24 16:36:53.666427: I 
tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports 
instructions that this TensorFlow binary was not compiled to use: SSE4.1 
SSE4.2 AVX
2018-01-24 16:36:55.069050: I 
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node 
read from SysFS had negative value (-1), but there must be at least one NUMA 
node, so returning NUMA node zero
2018-01-24 16:36:55.069287: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1062] Found device 0 with 
properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-01-24 16:36:55.069316: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1152] Creating TensorFlow 
device (/device:GPU:0) -> (device: 0, name: GRID K520, pci bus id: 
0000:00:03.0, compute capability: 3.0)
2018-01-24 16:37:59.766001: W 
tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran 
out of memory trying to allocate 2.00GiB.  Current allocation summary follows.
2018-01-24 16:37:59.766054: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256):     Total 
Chunks: 10, Chunks in use: 10. 2.5KiB allocated for chunks. 2.5KiB in use in 
bin. 40B client-requested in use in bin.
2018-01-24 16:37:59.766070: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766084: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1024):    Total 
Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in 
bin. 1.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766094: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2048):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766108: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4096):    Total 
Chunks: 2, Chunks in use: 2. 12.5KiB allocated for chunks. 12.5KiB in use in 
bin. 12.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766122: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8192):    Total 
Chunks: 2, Chunks in use: 2. 24.5KiB allocated for chunks. 24.5KiB in use in 
bin. 24.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766134: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16384):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766143: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (32768):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766155: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (65536):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766163: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (131072):  Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766177: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (262144):  Total 
Chunks: 2, Chunks in use: 2. 800.0KiB allocated for chunks. 800.0KiB in use in 
bin. 800.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766196: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (524288):  Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766208: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1048576):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766221: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2097152):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766230: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4194304):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766241: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8388608):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766250: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16777216):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766262: I         
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (33554432):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766271: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (67108864):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766282: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (134217728):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766292: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (268435456):   Total 
Chunks: 2, Chunks in use: 1. 3.57GiB allocated for chunks. 2.00GiB in use in 
bin. 2.00GiB client-requested in use in bin.
2018-01-24 16:37:59.766304: I 
tensorflow/core/common_runtime/bfc_allocator.cc:644] Bin for 2.00GiB was 
256.00MiB, Chunk State: 
2018-01-24 16:37:59.766335: I 
tensorflow/core/common_runtime/bfc_allocator.cc:650]   Size: 1.57GiB | 
Requested Size: 0B | in_use: 0, prev:   Size: 2.00GiB | Requested Size: 
2.00GiB | in_use: 1
2018-01-24 16:37:59.766358: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680000 of 
size 1280
2018-01-24 16:37:59.766374: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680500 of 
size 256
2018-01-24 16:37:59.766381: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680600 of 
size 256
2018-01-24 16:37:59.766387: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680700 of 
size 256
2018-01-24 16:37:59.766397: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680800 of 
size 256
2018-01-24 16:37:59.766402: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680900 of 
size 256
2018-01-24 16:37:59.766412: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680a00 of 
size 256
2018-01-24 16:37:59.766422: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680b00 of 
size 256
2018-01-24 16:37:59.766429: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680c00 of 
size 256
2018-01-24 16:37:59.766435: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680d00 of 
size 256
2018-01-24 16:37:59.766459: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680e00 of 
size 256
2018-01-24 16:37:59.766471: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680f00 of 
size 6400
2018-01-24 16:37:59.766477: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702682800 of 
size 6400
2018-01-24 16:37:59.766482: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702684100 of 
size 409600
2018-01-24 16:37:59.766492: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x7026e8100 of 
size 409600
2018-01-24 16:37:59.766499: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274c100 of 
size 12544
2018-01-24 16:37:59.766509: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274f200 of 
size 12544
2018-01-24 16:37:59.766517: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702752300 of 
size 2147483648
2018-01-24 16:37:59.766523: I 
tensorflow/core/common_runtime/bfc_allocator.cc:671] Free at 0x782752300 of 
size 1684724992
2018-01-24 16:37:59.766530: I 
tensorflow/core/common_runtime/bfc_allocator.cc:677]      Summary of in-use 
Chunks by size: 
2018-01-24 16:37:59.766543: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 10 Chunks of size 256 
totalling 2.5KiB
2018-01-24 16:37:59.766557: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 1280 
totalling 1.2KiB
2018-01-24 16:37:59.766569: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 6400 
totalling 12.5KiB
2018-01-24 16:37:59.766577: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 12544 
totalling 24.5KiB
2018-01-24 16:37:59.766585: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 409600 
totalling 800.0KiB
2018-01-24 16:37:59.766596: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 
2147483648 totalling 2.00GiB
2018-01-24 16:37:59.766606: I 
tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use 
chunks: 2.00GiB
2018-01-24 16:37:59.766620: I 
tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats: 
Limit:                  3833069568
InUse:                  2148344576
MaxInUse:               2148344576
NumAllocs:                      18
MaxAllocSize:           2147483648

2018-01-24 16:37:59.766635: W 
tensorflow/core/common_runtime/bfc_allocator.cc:277] 

2018-01-24 16:37:59.766660: W tensorflow/core/framework/op_kernel.cc:1188] 
Resource exhausted: OOM when allocating tensor of shape [32768,16384] and type 
float
2018-01-24 16:38:00.828932: E tensorflow/core/common_runtime/executor.cc:651] 
Executor failed to create kernel. Resource exhausted: OOM when allocating 
tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 240, in model
sess.run(init)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 889, in run
run_metadata_ptr)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1317, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when 
allocating tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op u'fc1/weights/RMSProp_1/Initializer/zeros', defined at:
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 228, in model
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 365, in minimize
name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 516, in 
apply_gradients
self._create_slots([_get_variable_for(v) for v in var_list])
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/rmsprop.py", 
line 113, in _create_slots
self._zeros_slot(v, "momentum", self._name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 882, in _zeros_slot
named_slots[_var_key(var)] = slot_creator.create_zeros_slot(var, op_name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 174, in 
create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 148, in 
create_slot_with_initializer
dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 67, in 
_create_slot_var
validate_shape=validate_shape)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1256, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1097, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 435, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 404, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 806, in 
_get_single_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", 
line 229, in __init__
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", 
line 323, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 780, in <lambda>
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", 
line 93, in __call__
return array_ops.zeros(shape, dtype)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", 
line 1509, in zeros
output = constant(zero, shape=shape, dtype=dtype, name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/framework/constant_op.py", line 218, in constant
name=name).outputs[0]
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", 
line 3069, in create_op
op_def=op_def)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", 
line 1579, in __init__
self._traceback = self._graph._extract_stack()  # pylint: disable=protected-
access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor 
of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Errore di segmentazione

1 个答案:

答案 0 :(得分:3)

您需要的内存量在很大程度上取决于Tensor的大小,但在您使用的数据类型上也是如此(int32,int64,float16,float32,float64)。 所以问题1:你的Tensor需要32768 x 16384 x memory_size_of_your_datatype内存(例如,浮点数64的内存占用量是64位,顾名思义,这是8字节,所以在这种情况下你的Tensor需要4.3e9字节或4.3千兆字节) 因此,如果精度损失不会过多地损害您的精度,那么减少内存消耗的一种简单方法就是从float64到float32甚至float16(分别为1/2和1/4)。 此外,您必须了解AWS实例的总内存是如何组成的,即构成实例的GPU的GPU RAM是什么,这是此处的关键内存。

另外,请查看https://www.tensorflow.org/api_docs/python/tf/profiler/Profiler

修改 您可以将tf.ConfigProto()传递给您的tf.Session(config = ...),您可以通过它指定GPU使用情况。

特别是,请查看allow_growthallow_soft_placementper_process_gpu_memory_fraction选项 (特别是最后一个应该帮助你)