Keras,GPU内存不足

时间:2018-11-13 04:29:05

标签: amazon-web-services tensorflow keras gpu

我也只想在此处发布详细信息。但基本上,我有一个内存不足的实现。

参考的Github问题在这里: https://github.com/keras-team/keras/issues/11624

这是错误消息:

UserWarning: Viewer requires Qt
  warn('Viewer requires Qt')
2018-11-12 09:30:54.179843: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-12 09:31:11.234972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.236072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:17.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.322354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.323475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:18.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.413172: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.414297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:19.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.510326: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.511434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1a.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.617084: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.618204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 4 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1b.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.719956: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.721063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 5 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1c.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.825226: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.826376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 6 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1d.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.935858: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 09:31:11.936963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 7 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-12 09:31:11.945353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-11-12 09:31:14.423061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-12 09:31:14.423126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3 4 5 6 7
2018-11-12 09:31:14.423139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y Y Y Y Y Y Y
2018-11-12 09:31:14.423147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N Y Y Y Y Y Y
2018-11-12 09:31:14.423155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   Y Y N Y Y Y Y Y
2018-11-12 09:31:14.423162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   Y Y Y N Y Y Y Y
2018-11-12 09:31:14.423169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 4:   Y Y Y Y N Y Y Y
2018-11-12 09:31:14.423177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 5:   Y Y Y Y Y N Y Y
2018-11-12 09:31:14.423186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 6:   Y Y Y Y Y Y N Y
2018-11-12 09:31:14.423196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 7:   Y Y Y Y Y Y Y N
2018-11-12 09:31:14.425010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:17.0, compute capability: 3.7)
2018-11-12 09:31:14.425736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10757 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:00:18.0, compute capability: 3.7)
2018-11-12 09:31:14.426309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10757 MB memory) -> physical GPU (device: 2, name: Tesla K80, pci bus id: 0000:00:19.0, compute capability: 3.7)
2018-11-12 09:31:14.426869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10757 MB memory) -> physical GPU (device: 3, name: Tesla K80, pci bus id: 0000:00:1a.0, compute capability: 3.7)
2018-11-12 09:31:14.427875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10757 MB memory) -> physical GPU (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)
2018-11-12 09:31:14.428440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10757 MB memory) -> physical GPU (device: 5, name: Tesla K80, pci bus id: 0000:00:1c.0, compute capability: 3.7)
2018-11-12 09:31:14.428998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 10757 MB memory) -> physical GPU (device: 6, name: Tesla K80, pci bus id: 0000:00:1d.0, compute capability: 3.7)
2018-11-12 09:31:14.429564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 10757 MB memory) -> physical GPU (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
/app/networks.py:240: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(units=3, activation="linear")`
  model.add(Dense(output_dim=action_size, activation='linear'))
2018-11-12 09:31:29.037056: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.09GiB.  Current allocation summary follows.
2018-11-12 09:31:29.037156: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 36, Chunks in use: 36. 9.0KiB allocated for chunks. 9.0KiB in use in bin. 1.8KiB client-requested in use in bin.
2018-11-12 09:31:29.037186: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037207: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2018-11-12 09:31:29.037224: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037248: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096):  Total Chunks: 2, Chunks in use: 2. 12.0KiB allocated for chunks. 12.0KiB in use in bin. 12.0KiB client-requested in use in bin.
2018-11-12 09:31:29.037272: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192):  Total Chunks: 5, Chunks in use: 5. 40.0KiB allocated for chunks. 40.0KiB in use in bin. 40.0KiB client-requested in use in bin.
2018-11-12 09:31:29.037289: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037317: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037336: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037358: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072):        Total Chunks: 4, Chunks in use: 4. 544.0KiB allocated for chunks. 544.0KiB in use in bin. 544.0KiB client-requested in use in bin.
2018-11-12 09:31:29.037439: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144):        Total Chunks: 1, Chunks in use: 0. 417.8KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037449: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037457: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037465: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037474: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304):       Total Chunks: 4, Chunks in use: 4. 16.00MiB allocated for chunks. 16.00MiB in use in bin. 16.00MiB client-requested in use in bin.
2018-11-12 09:31:29.037484: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037492: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037501: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037509: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037520: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-11-12 09:31:29.037530: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456):     Total Chunks: 4, Chunks in use: 2. 8.00GiB allocated for chunks. 6.18GiB in use in bin. 6.18GiB client-requested in use in bin.
2018-11-12 09:31:29.037540: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 3.09GiB was 256.00MiB, Chunk State:
2018-11-12 09:31:29.037553: I tensorflow/core/common_runtime/bfc_allocator.cc:619]   Size: 934.00MiB | Requested Size: 12B | in_use: 0, prev:   Size: 3.09GiB | Requested Size: 3.09GiB | in_use: 1
2018-11-12 09:31:29.037564: I tensorflow/core/common_runtime/bfc_allocator.cc:619]   Size: 934.00MiB | Requested Size: 0B | in_use: 0, prev:   Size: 3.09GiB | Requested Size: 3.09GiB | in_use: 1
2018-11-12 09:31:29.037574: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00000 of size 1280
2018-11-12 09:31:29.037584: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00500 of size 256
2018-11-12 09:31:29.037591: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00600 of size 256
2018-11-12 09:31:29.037599: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00700 of size 256
2018-11-12 09:31:29.037606: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00800 of size 256
2018-11-12 09:31:29.037619: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00900 of size 256
2018-11-12 09:31:29.037627: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00a00 of size 256
2018-11-12 09:31:29.037633: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00b00 of size 256
2018-11-12 09:31:29.037640: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00c00 of size 256
2018-11-12 09:31:29.037647: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe00d00 of size 8192
2018-11-12 09:31:29.037655: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe02d00 of size 256
2018-11-12 09:31:29.037662: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe02e00 of size 256
2018-11-12 09:31:29.037668: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe02f00 of size 256
2018-11-12 09:31:29.037677: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03000 of size 256
2018-11-12 09:31:29.037683: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03100 of size 256
2018-11-12 09:31:29.037690: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03200 of size 256
2018-11-12 09:31:29.037702: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03300 of size 256
2018-11-12 09:31:29.037709: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03400 of size 256
2018-11-12 09:31:29.037716: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03500 of size 256
2018-11-12 09:31:29.037722: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03600 of size 256
2018-11-12 09:31:29.037731: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03700 of size 256
2018-11-12 09:31:29.037738: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03800 of size 256
2018-11-12 09:31:29.037745: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03900 of size 256
2018-11-12 09:31:29.037754: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03a00 of size 256
2018-11-12 09:31:29.037761: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03b00 of size 256
2018-11-12 09:31:29.037767: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03c00 of size 256
2018-11-12 09:31:29.037774: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03d00 of size 256
2018-11-12 09:31:29.037784: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03e00 of size 256
2018-11-12 09:31:29.037791: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe03f00 of size 256
2018-11-12 09:31:29.037797: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe04000 of size 256
2018-11-12 09:31:29.037804: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe04100 of size 256
2018-11-12 09:31:29.037812: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe04200 of size 256
2018-11-12 09:31:29.037819: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe04300 of size 6144
2018-11-12 09:31:29.037828: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe05b00 of size 6144
2018-11-12 09:31:29.037835: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe07300 of size 8192
2018-11-12 09:31:29.037841: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe09300 of size 8192
2018-11-12 09:31:29.037848: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe0b300 of size 256
2018-11-12 09:31:29.037857: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe0b400 of size 256
2018-11-12 09:31:29.037864: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe0b500 of size 8192
2018-11-12 09:31:29.037870: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe0d500 of size 8192
2018-11-12 09:31:29.037880: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe0f500 of size 131072
2018-11-12 09:31:29.037886: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe2f500 of size 131072
2018-11-12 09:31:29.037893: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe4f500 of size 256
2018-11-12 09:31:29.037903: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe4f600 of size 256
2018-11-12 09:31:29.037910: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe4f700 of size 256
2018-11-12 09:31:29.037916: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe4f800 of size 256
2018-11-12 09:31:29.037929: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe4f900 of size 147456
2018-11-12 09:31:29.037936: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1fe73900 of size 147456
2018-11-12 09:31:29.037943: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free  at 0x7c1fe97900 of size 427776
2018-11-12 09:31:29.037952: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c1ff00000 of size 4194304
2018-11-12 09:31:29.037958: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c20300000 of size 4194304
2018-11-12 09:31:29.037967: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c20700000 of size 4194304
2018-11-12 09:31:29.037974: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c20b00000 of size 4194304
2018-11-12 09:31:29.037980: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7c20f00000 of size 3315597312
2018-11-12 09:31:29.037989: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free  at 0x7ce6900000 of size 979369984
2018-11-12 09:31:29.037996: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7d20f00000 of size 3315597312
2018-11-12 09:31:29.038003: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free  at 0x7de6900000 of size 979369984
2018-11-12 09:31:29.038011: I tensorflow/core/common_runtime/bfc_allocator.cc:638]      Summary of in-use Chunks by size:
2018-11-12 09:31:29.038020: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 36 Chunks of size 256 totalling 9.0KiB
2018-11-12 09:31:29.038030: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 1280 totalling 1.2KiB
2018-11-12 09:31:29.038038: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 6144 totalling 12.0KiB
2018-11-12 09:31:29.038046: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 5 Chunks of size 8192 totalling 40.0KiB
2018-11-12 09:31:29.038053: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 131072 totalling 256.0KiB
2018-11-12 09:31:29.038063: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 147456 totalling 288.0KiB
2018-11-12 09:31:29.038071: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 4 Chunks of size 4194304 totalling 16.00MiB
2018-11-12 09:31:29.038079: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 3315597312 totalling 6.18GiB
2018-11-12 09:31:29.038088: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 6.19GiB
2018-11-12 09:31:29.038098: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit:                 11279748301
InUse:                  6648592640
MaxInUse:               6653229824
NumAllocs:                      64
MaxAllocSize:           3315597312

2018-11-12 09:31:29.038112: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***************************************___________***************************************___________
Traceback (most recent call last):
  File "./ddrqn_per.py", line 355, in <module>
State Size (4, 216, 43, 1)
    agent.update_target_model()
  File "./ddrqn_per.py", line 153, in update_target_model
    self.target_model.set_weights(self.model.get_weights())
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 508, in set_weights
    K.batch_set_value(tuples)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 2470, in batch_set_value
    get_session().run(assign_ops, feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
         [[{{node _arg_Placeholder_6_0_7/_101}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_50__arg_Placeholder_6_0_7", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
         [[{{node Assign_6/_123}} = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_71_Assign_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

这个让我感到难过。是不是GPU内存无法实际堆叠,是否意味着GPU越多,内存就越多?

脚本正在docker容器中运行:FROM tensorflow / tensorflow:latest-gpu

Docker.gpu看起来像这样:

FROM tensorflow/tensorflow:latest-gpu

ADD ./gloob-bot-model /app
RUN mkdir /app/images && mkdir /app/models
ADD ./recordings.zip /app/recordings.zip

RUN apt-get update -qq &&\
    apt-get install --no-install-recommends -y \
    python-tk unzip

RUN pip install -r /app/requirements.gpu.txt

ENV PYTHONPATH "${PYTHONPATH}:/app"

CMD cd /app && unzip recordings.zip && python ./ddrqn_per.py

我已将批量大小更改为1!我正在使用AWS p2.8xlarge。因此有8个NVIDIA K80 GPU,每个12gig

我的输入大小是:

batch_size x 4 x 216 x 40 x 1

我正在用枕头制作图像, 然后将它们转换为灰度

image = image.convert('L')

然后将其中的4个数组放入模型中

这是我的模特

model = Sequential()
        model.add(TimeDistributed(Conv2D(32, 8, activation='relu'), input_shape=input_shape))
        model.add(TimeDistributed(Conv2D(64, 4, activation='relu')))
        model.add(TimeDistributed(Conv2D(64, 3, activation='relu')))
        model.add(TimeDistributed(Flatten()))

        # Use all traces for training
        #model.add(LSTM(512, return_sequences=True,  activation='tanh'))
        #model.add(TimeDistributed(Dense(output_dim=action_size, activation='linear')))

        # Use last trace for training
        model.add(LSTM(512,  activation='tanh'))
        model.add(Dense(output_dim=action_size, activation='linear'))

        adam = Adam(lr=learning_rate)
        model.compile(loss=huber_loss, optimizer=adam)

这是我的用户数据在初始化时运行。我正在aws Ubuntu机器上安装NVIDIA的东西。 (不知道我是否应该这样做,但是没有它是行不通的。)

https://gist.github.com/kevupton/c963cd237ed8ad24b1140694fe867db2

我的配置设置:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
K.set_session(sess)

1 个答案:

答案 0 :(得分:0)

好的,问题是这样的:

with tf.device('/cpu:0'):
    model = generate_model()

此代码消除了OOM错误。

使用他们的文档: 参考:https://keras.io/utils/#multi_gpu_model

from keras.utils import multi_gpu_model

# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
    model = Xception(weights=None,
                     input_shape=(height, width, 3),
                     classes=num_classes)

# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer='rmsprop')

我实现了:

    def generate_model():
        model = Sequential()
        model.add(TimeDistributed(Conv2D(32, 8, activation='relu'), input_shape=input_shape))
        model.add(TimeDistributed(Conv2D(64, 4, activation='relu')))
        model.add(TimeDistributed(Conv2D(64, 3, activation='relu')))
        model.add(TimeDistributed(Flatten()))

        # Use all traces for training
        #model.add(LSTM(512, return_sequences=True,  activation='tanh'))
        #model.add(TimeDistributed(Dense(output_dim=action_size, activation='linear')))

        # Use last trace for training
        model.add(LSTM(512,  activation='tanh'))
        model.add(Dense(output_dim=action_size, activation='linear'))

        return model

    if total_gpus is None:
        model = generate_model()
    else:
        with tf.device('/cpu:0'):
            model = generate_model()

        model = multi_gpu_model(model, gpus=total_gpus)

    adam = Adam(lr=learning_rate)
    model.compile(loss=huber_loss, optimizer=adam)

编辑! 这种方法的唯一问题是它比我的CPU慢