我开了some tensorflow code 像这样的服务器:
但进展总是停留在"创建TensorFlow设备"并且不再显示任何信息,并且终点已经死亡。 其他tf项目运作良好,但总会失败。
2017-11-20 23:32:51.701175: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701252: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701280: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701293: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701320: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:52.552691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 1.32GiB
2017-11-20 23:32:53.059142: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x842f610 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.060813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:05:00.0
Total memory: 7.92GiB
Free memory: 7.70GiB
2017-11-20 23:32:53.582481: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8433370 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.584843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 2 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:08:00.0
Total memory: 7.92GiB
Free memory: 1.16GiB
2017-11-20 23:32:54.094126: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84370d0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.095696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 3 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:09:00.0
Total memory: 7.92GiB
Free memory: 1.82GiB
2017-11-20 23:32:54.633158: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843ae30 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.634412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 4 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:84:00.0
Total memory: 7.92GiB
Free memory: 1.80GiB
2017-11-20 23:32:55.226210: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843eb90 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.227841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 5 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:85:00.0
Total memory: 7.92GiB
Free memory: 4.79GiB
2017-11-20 23:32:55.789872: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84428f0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.790904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 6 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:88:00.0
Total memory: 7.92GiB
Free memory: 773.00MiB
2017-11-20 23:32:56.371886: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8446650 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:56.373006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 7 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:89:00.0
Total memory: 7.92GiB
Free memory: 1001.00MiB
2017-11-20 23:32:56.374795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 4
2017-11-20 23:32:56.374826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 5
2017-11-20 23:32:56.374838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 6
2017-11-20 23:32:56.374850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 7
2017-11-20 23:32:56.375122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 4
2017-11-20 23:32:56.375151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 5
2017-11-20 23:32:56.375178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 6
2017-11-20 23:32:56.375189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 7
2017-11-20 23:32:56.375359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 4
2017-11-20 23:32:56.375374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 5
2017-11-20 23:32:56.375385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 6
2017-11-20 23:32:56.375397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 7
2017-11-20 23:32:56.375465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 4
2017-11-20 23:32:56.375477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 5
2017-11-20 23:32:56.375490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 6
2017-11-20 23:32:56.375501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 7
2017-11-20 23:32:56.375513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 0
2017-11-20 23:32:56.375524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 1
2017-11-20 23:32:56.375536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 2
2017-11-20 23:32:56.375548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 3
2017-11-20 23:32:56.377945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 0
2017-11-20 23:32:56.378028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 1
2017-11-20 23:32:56.378052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 2
2017-11-20 23:32:56.378074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 3
2017-11-20 23:32:56.378504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 0
2017-11-20 23:32:56.378528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 1
2017-11-20 23:32:56.378556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 2
2017-11-20 23:32:56.378591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 3
2017-11-20 23:32:56.378883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 0
2017-11-20 23:32:56.378912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 1
2017-11-20 23:32:56.378936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 2
2017-11-20 23:32:56.378959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 3
2017-11-20 23:32:56.379568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 4 5 6 7
2017-11-20 23:32:56.379591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y Y Y N N N N
2017-11-20 23:32:56.379607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y Y Y N N N N
2017-11-20 23:32:56.379621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2: Y Y Y Y N N N N
2017-11-20 23:32:56.379635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3: Y Y Y Y N N N N
2017-11-20 23:32:56.379651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 4: N N N N Y Y Y Y
2017-11-20 23:32:56.379664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 5: N N N N Y Y Y Y
2017-11-20 23:32:56.379680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 6: N N N N Y Y Y Y
2017-11-20 23:32:56.379694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 7: N N N N Y Y Y Y
2017-11-20 23:32:56.379724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-11-20 23:32:56.379742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:05:00.0)
2017-11-20 23:32:56.379757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci bus id: 0000:08:00.0)
2017-11-20 23:32:56.379772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci bus id: 0000:09:00.0)
2017-11-20 23:32:56.379786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: GeForce GTX 1080, pci bus id: 0000:84:00.0)
2017-11-20 23:32:56.379800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: GeForce GTX 1080, pci bus id: 0000:85:00.0)
2017-11-20 23:32:56.379816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: GeForce GTX 1080, pci bus id: 0000:88:00.0)
2017-11-20 23:32:56.379829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: GeForce GTX 1080, pci bus id: 0000:89:00.0)
而nvidia-smi
显示:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 On | 0000:04:00.0 Off | N/A |
| 56% 76C P2 161W / 180W | 6652MiB / 8114MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 On | 0000:05:00.0 Off | N/A |
| 24% 38C P8 11W / 180W | 115MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 On | 0000:08:00.0 Off | N/A |
| 24% 42C P8 12W / 180W | 6820MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 1080 On | 0000:09:00.0 Off | N/A |
| 24% 44C P8 12W / 180W | 6142MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 1080 On | 0000:84:00.0 Off | N/A |
| 67% 82C P2 179W / 180W | 6160MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 1080 On | 0000:85:00.0 Off | N/A |
| 48% 71C P2 81W / 180W | 3094MiB / 8114MiB | 76% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 1080 On | 0000:88:00.0 Off | N/A |
| 25% 58C P2 51W / 180W | 7230MiB / 8114MiB | 66% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 1080 On | 0000:89:00.0 Off | N/A |
| 26% 59C P2 52W / 180W | 7002MiB / 8114MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17811 C python 327MiB |
| 0 21873 C python 254MiB |
| 0 25407 C ../../caffe/build/tools/caffe 6065MiB |
| 1 17811 C python 113MiB |
| 2 17811 C python 113MiB |
| 2 22605 C python 6705MiB |
| 3 17811 C python 113MiB |
| 3 22605 C python 6027MiB |
| 4 8984 C ./build/tools/caffe 6045MiB |
| 4 17811 C python 113MiB |
| 5 17811 C python 113MiB |
| 5 21873 C python 2977MiB |
| 6 13442 C python 7115MiB |
| 6 17811 C python 113MiB |
| 7 13442 C python 6887MiB |
| 7 17811 C python 113MiB |
+-----------------------------------------------------------------------------+
因为它没有引发任何CUDA内存错误,并且内存足够并且正在使用(我看到CPU使用了一点),我不认为它是一个资源问题。 ......但是,我坚持了20多个小时......有人能帮帮我吗? 非常感谢。
答案 0 :(得分:0)
当我在 tfrecords 2.x 中使用 Tensorflow 数据集 API 时,我也遇到了类似的问题,我有大约 24k tfrecords,我以前的数据管道是
def load_training_tfrecords(record_mask_file,SHUFFLE):
dataset=tf.data.Dataset.list_files(record_mask_file).interleave(lambda x: tf.data.TFRecordDataset(x),cycle_length=NUMBER_OF_PARALLEL_CALL,num_parallel_calls=NUMBER_OF_PARALLEL_CALL)
dataset=dataset.map(decode_memo_train).map(augmentation_Aggresive).repeat().batch(BATCH_SIZE)
batched_dataset=dataset.prefetch(PARSHING)
return batched_dataset
我已经删除了解决我问题的 Interleave。您还可以避免预取。