Question

我开了some tensorflow code 像这样的服务器：

8 Nvidia GTX1080
约40G的内存图形
200GB内存

但进展总是停留在＆＃34;创建TensorFlow设备＆＃34;并且不再显示任何信息，并且终点已经死亡。其他tf项目运作良好，但总会失败。

2017-11-20 23:32:51.701175: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701252: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701280: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701293: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701320: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:52.552691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 1.32GiB
2017-11-20 23:32:53.059142: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x842f610 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.060813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:05:00.0
Total memory: 7.92GiB
Free memory: 7.70GiB
2017-11-20 23:32:53.582481: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8433370 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.584843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 2 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:08:00.0
Total memory: 7.92GiB
Free memory: 1.16GiB
2017-11-20 23:32:54.094126: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84370d0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.095696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 3 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:09:00.0
Total memory: 7.92GiB
Free memory: 1.82GiB
2017-11-20 23:32:54.633158: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843ae30 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.634412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 4 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:84:00.0
Total memory: 7.92GiB
Free memory: 1.80GiB
2017-11-20 23:32:55.226210: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843eb90 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.227841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 5 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:85:00.0
Total memory: 7.92GiB
Free memory: 4.79GiB
2017-11-20 23:32:55.789872: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84428f0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.790904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 6 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:88:00.0
Total memory: 7.92GiB
Free memory: 773.00MiB
2017-11-20 23:32:56.371886: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8446650 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:56.373006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 7 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:89:00.0
Total memory: 7.92GiB
Free memory: 1001.00MiB
2017-11-20 23:32:56.374795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 4
2017-11-20 23:32:56.374826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 5
2017-11-20 23:32:56.374838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 6
2017-11-20 23:32:56.374850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 7
2017-11-20 23:32:56.375122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 4
2017-11-20 23:32:56.375151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 5
2017-11-20 23:32:56.375178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 6
2017-11-20 23:32:56.375189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 7
2017-11-20 23:32:56.375359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 4
2017-11-20 23:32:56.375374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 5
2017-11-20 23:32:56.375385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 6
2017-11-20 23:32:56.375397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 7
2017-11-20 23:32:56.375465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 4
2017-11-20 23:32:56.375477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 5
2017-11-20 23:32:56.375490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 6
2017-11-20 23:32:56.375501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 7
2017-11-20 23:32:56.375513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 0
2017-11-20 23:32:56.375524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 1
2017-11-20 23:32:56.375536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 2
2017-11-20 23:32:56.375548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 3
2017-11-20 23:32:56.377945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 0
2017-11-20 23:32:56.378028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 1
2017-11-20 23:32:56.378052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 2
2017-11-20 23:32:56.378074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 3
2017-11-20 23:32:56.378504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 0
2017-11-20 23:32:56.378528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 1
2017-11-20 23:32:56.378556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 2
2017-11-20 23:32:56.378591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 3
2017-11-20 23:32:56.378883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 0
2017-11-20 23:32:56.378912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 1
2017-11-20 23:32:56.378936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 2
2017-11-20 23:32:56.378959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 3
2017-11-20 23:32:56.379568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 4 5 6 7
2017-11-20 23:32:56.379591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y Y Y N N N N
2017-11-20 23:32:56.379607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y Y Y N N N N
2017-11-20 23:32:56.379621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2:   Y Y Y Y N N N N
2017-11-20 23:32:56.379635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3:   Y Y Y Y N N N N
2017-11-20 23:32:56.379651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 4:   N N N N Y Y Y Y
2017-11-20 23:32:56.379664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 5:   N N N N Y Y Y Y
2017-11-20 23:32:56.379680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 6:   N N N N Y Y Y Y
2017-11-20 23:32:56.379694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 7:   N N N N Y Y Y Y
2017-11-20 23:32:56.379724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-11-20 23:32:56.379742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:05:00.0)
2017-11-20 23:32:56.379757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci bus id: 0000:08:00.0)
2017-11-20 23:32:56.379772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci bus id: 0000:09:00.0)
2017-11-20 23:32:56.379786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: GeForce GTX 1080, pci bus id: 0000:84:00.0)
2017-11-20 23:32:56.379800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: GeForce GTX 1080, pci bus id: 0000:85:00.0)
2017-11-20 23:32:56.379816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: GeForce GTX 1080, pci bus id: 0000:88:00.0)
2017-11-20 23:32:56.379829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: GeForce GTX 1080, pci bus id: 0000:89:00.0)

而nvidia-smi显示：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 0000:04:00.0     Off |                  N/A |
| 56%   76C    P2   161W / 180W |   6652MiB /  8114MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 0000:05:00.0     Off |                  N/A |
| 24%   38C    P8    11W / 180W |    115MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    On   | 0000:08:00.0     Off |                  N/A |
| 24%   42C    P8    12W / 180W |   6820MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    On   | 0000:09:00.0     Off |                  N/A |
| 24%   44C    P8    12W / 180W |   6142MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 1080    On   | 0000:84:00.0     Off |                  N/A |
| 67%   82C    P2   179W / 180W |   6160MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 1080    On   | 0000:85:00.0     Off |                  N/A |
| 48%   71C    P2    81W / 180W |   3094MiB /  8114MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 1080    On   | 0000:88:00.0     Off |                  N/A |
| 25%   58C    P2    51W / 180W |   7230MiB /  8114MiB |     66%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 1080    On   | 0000:89:00.0     Off |                  N/A |
| 26%   59C    P2    52W / 180W |   7002MiB /  8114MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     17811    C   python                                         327MiB |
|    0     21873    C   python                                         254MiB |
|    0     25407    C   ../../caffe/build/tools/caffe                 6065MiB |
|    1     17811    C   python                                         113MiB |
|    2     17811    C   python                                         113MiB |
|    2     22605    C   python                                        6705MiB |
|    3     17811    C   python                                         113MiB |
|    3     22605    C   python                                        6027MiB |
|    4      8984    C   ./build/tools/caffe                           6045MiB |
|    4     17811    C   python                                         113MiB |
|    5     17811    C   python                                         113MiB |
|    5     21873    C   python                                        2977MiB |
|    6     13442    C   python                                        7115MiB |
|    6     17811    C   python                                         113MiB |
|    7     13442    C   python                                        6887MiB |
|    7     17811    C   python                                         113MiB |
+-----------------------------------------------------------------------------+

因为它没有引发任何CUDA内存错误，并且内存足够并且正在使用（我看到CPU使用了一点），我不认为它是一个资源问题。 ......但是，我坚持了20多个小时......有人能帮帮我吗？非常感谢。

Answer 1

当我在 tfrecords 2.x 中使用 Tensorflow 数据集 API 时，我也遇到了类似的问题，我有大约 24k tfrecords，我以前的数据管道是

def load_training_tfrecords(record_mask_file,SHUFFLE):
dataset=tf.data.Dataset.list_files(record_mask_file).interleave(lambda x: tf.data.TFRecordDataset(x),cycle_length=NUMBER_OF_PARALLEL_CALL,num_parallel_calls=NUMBER_OF_PARALLEL_CALL)
dataset=dataset.map(decode_memo_train).map(augmentation_Aggresive).repeat().batch(BATCH_SIZE)
batched_dataset=dataset.prefetch(PARSHING)
return batched_dataset

我已经删除了解决我问题的 Interleave。您还可以避免预取。

在＆＃34;创建Tensor设备＆＃34;之后，运行张量流停滞不前

1 个答案: