在张量流中训练模型的麻烦

时间:2020-04-17 17:58:03

标签: python tensorflow nvidia cudnn

我正在尝试使用tensorflow GPU支持在python中训练CNN。直到出现以下错误,直到我调用model.fit为止,一切似乎都还可以:

Num GPUs Available:  1
WARNING:tensorflow:From borrar.py:48: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
Train for 390 steps
2020-04-17 19:29:39.071613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-17 19:29:39.072026: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072079: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072117: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072872: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072915: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072965: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.073001: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.073029: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.073081: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.097419: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-17 19:29:39.708887: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-17 19:29:39.712852: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-17 19:29:39.712900: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

我已经考虑过版本之间的不兼容性,但是一切似乎都还可以。以下是有关我的nvidia驱动程序和库的一些信息:

操作系统:Linux Mint 19.3

nvidia-smi命令输出:

Fri Apr 17 19:46:18 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   42C    P8     7W /  N/A |    455MiB /  5944MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1287      G   /usr/lib/xorg/Xorg                           238MiB |
|    0      1977      G   cinnamon                                     103MiB |
|    0      2550      G   ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files   111MiB |
+-----------------------------------------------------------------------------+

CUDA版本(cat /usr/local/cuda/version.txt):

CUDA Version 10.1.243

CuDNN版本(cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2):

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

我还附加了代码的整个输出,以便您可以获得更多信息:

python3 borrar.py 
2020-04-17 19:29:36.584212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-17 19:29:36.585508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
                                               image  labels
0  /home/inaki/Desktop/dogs-vs-cats/train/cat.640...       0
1  /home/inaki/Desktop/dogs-vs-cats/train/dog.427...       1
2  /home/inaki/Desktop/dogs-vs-cats/train/cat.588...       0
3  /home/inaki/Desktop/dogs-vs-cats/train/dog.712...       1
4  /home/inaki/Desktop/dogs-vs-cats/train/dog.180...       1



2020-04-17 19:29:37.620115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-17 19:29:37.644433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.644678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s
2020-04-17 19:29:37.644954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-17 19:29:37.644989: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-17 19:29:37.646131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-17 19:29:37.646419: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-17 19:29:37.647520: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-17 19:29:37.648231: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-17 19:29:37.648261: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-17 19:29:37.648326: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.648565: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.648757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-17 19:29:37.648980: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-17 19:29:37.653650: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2020-04-17 19:29:37.654417: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56189f4f1640 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-17 19:29:37.654443: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-17 19:29:37.722073: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.722352: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56189df9abe0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-17 19:29:37.722366: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1660 Ti, Compute Capability 7.5
2020-04-17 19:29:37.722606: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.722833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s
2020-04-17 19:29:37.722904: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-17 19:29:37.722949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-17 19:29:37.722978: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-17 19:29:37.723025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-17 19:29:37.723039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-17 19:29:37.723053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-17 19:29:37.723063: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-17 19:29:37.723105: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.723309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.723483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-17 19:29:37.723530: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-17 19:29:37.724399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-17 19:29:37.724408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-04-17 19:29:37.724412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-04-17 19:29:37.724475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.724686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 19:29:37.724880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5107 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
ImageInput (InputLayer)      [(None, 250, 250, 3)]     0         
_________________________________________________________________
Conv1_1 (Conv2D)             (None, 250, 250, 64)      1792      
_________________________________________________________________
Conv1_2 (Conv2D)             (None, 250, 250, 64)      36928     
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 125, 125, 64)      0         
_________________________________________________________________
Conv2_1 (Conv2D)             (None, 125, 125, 128)     73856     
_________________________________________________________________
Conv2_2 (Conv2D)             (None, 125, 125, 128)     147584    
_________________________________________________________________
pool2 (MaxPooling2D)         (None, 62, 62, 128)       0         
_________________________________________________________________
Conv3_1 (Conv2D)             (None, 62, 62, 256)       295168    
_________________________________________________________________
bn1 (BatchNormalization)     (None, 62, 62, 256)       1024      
_________________________________________________________________
Conv3_2 (Conv2D)             (None, 62, 62, 256)       590080    
_________________________________________________________________
bn2 (BatchNormalization)     (None, 62, 62, 256)       1024      
_________________________________________________________________
Conv3_3 (Conv2D)             (None, 62, 62, 256)       590080    
_________________________________________________________________
pool3 (MaxPooling2D)         (None, 31, 31, 256)       0         
_________________________________________________________________
Conv4_1 (Conv2D)             (None, 31, 31, 512)       1180160   
_________________________________________________________________
bn3 (BatchNormalization)     (None, 31, 31, 512)       2048      
_________________________________________________________________
Conv4_2 (Conv2D)             (None, 31, 31, 512)       2359808   
_________________________________________________________________
bn4 (BatchNormalization)     (None, 31, 31, 512)       2048      
_________________________________________________________________
Conv4_3 (Conv2D)             (None, 31, 31, 512)       2359808   
_________________________________________________________________
pool4 (MaxPooling2D)         (None, 15, 15, 512)       0         
_________________________________________________________________
flatten (Flatten)            (None, 115200)            0         
_________________________________________________________________
fc1 (Dense)                  (None, 1024)              117965824 
_________________________________________________________________
fc2 (Dense)                  (None, 512)               524800    
_________________________________________________________________
fc3 (Dense)                  (None, 2)                 1026      
=================================================================
Total params: 126,133,058
Trainable params: 126,129,986
Non-trainable params: 3,072
_________________________________________________________________
Num GPUs Available:  1
WARNING:tensorflow:From borrar.py:48: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
Train for 390 steps
2020-04-17 19:29:39.071613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-17 19:29:39.072026: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072079: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072117: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072872: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072915: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.072965: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.073001: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.073029: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.073081: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-04-17 19:29:39.097419: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-17 19:29:39.708887: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-17 19:29:39.712852: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-17 19:29:39.712900: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node model/Conv1_1/Conv2D}}]]
  1/390 [..............................] - ETA: 8:09Traceback (most recent call last):
  File "borrar.py", line 252, in <module>
    main()
  File "borrar.py", line 48, in main
    history = model.fit_generator(generator=train_data_gen, epochs=n_epochs, steps_per_epoch=n_steps_per_epoch)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 1306, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 632, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/inaki/anaconda3/envs/tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node model/Conv1_1/Conv2D (defined at borrar.py:48) ]] [Op:__inference_distributed_function_2653]

Function call stack:
distributed_function

2020-04-17 19:29:39.798788: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

如果有人有任何疑问,请随时问我。

谢谢。

0 个答案:

没有答案