使用张量流时的分段错误

时间:2018-09-07 07:17:42

标签: tensorflow keras gpu

我在Keras和Tensorflow中的主程序遇到段错误

python fine-tune.py --train_dir /root/data/train_dump/ --val_dir 
/root/data/val_dump/ --nb_epoch 100 --batch_size 32
Using TensorFlow backend.
WARNING:root:Keras version 2.2.2 detected. Last version known to be fully compatible of Keras is 2.1.3 .
WARNING:root:TensorFlow version 1.10.0 detected. Last version known to be fully compatible is 1.5.0 .
Found 2976 images belonging to 4 classes.
Found 1100 images belonging to 4 classes.
2018-09-06 23:28:06.891710: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-06 23:28:08.030896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:08:00.0
totalMemory: 11.78GiB freeMemory: 11.36GiB
2018-09-06 23:28:08.030988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-06 23:28:08.688955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-06 23:28:08.689049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-09-06 23:28:08.689071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-09-06 23:28:08.689614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10974 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:08:00.0, compute capability: 7.0)
Epoch 1/100
2018-09-06 23:29:31.853354: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2018-09-06 23:29:31.853522: E tensorflow/stream_executor/cuda/cuda_dnn.cc:360] Possibly insufficient driver version: 390.30.0
Segmentation fault (core dumped)

通过此tf.session()测试,张量流似乎还可以:

In [2]: import tensorflow as tf
   ...:
   ...: with tf.device('/gpu:0'):
   ...:     a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
   ...:     b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
   ...:     c = tf.matmul(a, b)
   ...: with tf.Session() as sess:
   ...:     print(sess.run(c))
   ...:
2018-09-06 23:37:31.603090: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-06 23:37:32.765804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:08:00.0
totalMemory: 11.78GiB freeMemory: 11.36GiB
2018-09-06 23:37:32.765866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-06 23:37:33.358184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-06 23:37:33.358276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-09-06 23:37:33.358295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-09-06 23:37:33.358793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10974 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:08:00.0, compute capability: 7.0)
[[22. 28.]
 [49. 64.]]

认为警告可能与此有关,我确实尝试过。

Installing collected packages: keras
  Found existing installation: Keras 2.2.2
    Uninstalling Keras-2.2.2:
      Successfully uninstalled Keras-2.2.2
Successfully installed keras-2.1.3
(fastai) root@607b0f29-ad6b-482c-aead-aeae0a84fe2f:~# python fine-tune.py --train_dir /root/data/train_dump/ --val_dir /root/data/val_dump/ --nb_epoch 100 --batch_size 32
Using TensorFlow backend.
Found 2976 images belonging to 4 classes.
Found 1100 images belonging to 4 classes.
2018-09-07 00:30:10.722085: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-07 00:30:11.880320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:08:00.0
totalMemory: 11.78GiB freeMemory: 11.36GiB
2018-09-07 00:30:11.880388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN V, pci bus id: 0000:08:00.0, compute capability: 7.0)
Epoch 1/100
2018-09-07 00:30:51.858613: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2018-09-07 00:30:51.858965: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.30  Wed Jan 31 22:08:49 PST 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
"""
2018-09-07 00:30:51.859055: E tensorflow/stream_executor/cuda/cuda_dnn.cc:393] possibly insufficient driver version: 390.30.0
2018-09-07 00:30:51.859099: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
Aborted (core dumped)

0 个答案:

没有答案