拟合模型Ubuntu 18.04时,Python Tensorflow GPU错误

时间:2018-12-22 03:21:53

标签: python ubuntu tensorflow deep-learning

拟合模型Segmentation fault (core dumped)时收到以下错误。我在Ubuntu 18.04上使用Nvidia rtx 2070(用于CUDA)和AMD RX 570(用于4k显示器)。我不认为双GPU是一个问题,我可以在安装amd gpu之前在rtx 2070上成功运行代码。我遍历了本教程,为深度学习Installing Tensorflow-GPU设置了系统。以下是我尝试运行的从Install Tensorflow with GPU support获得的代码:

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten,  MaxPooling2D, Conv2D
from keras.callbacks import TensorBoard

(X_train,y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(60000,28,28,1).astype('float32')
X_test = X_test.reshape(10000,28,28,1).astype('float32')

X_train /= 255
X_test /= 255

n_classes = 10
y_train = keras.utils.to_categorical(y_train, n_classes)
y_test = keras.utils.to_categorical(y_test, n_classes)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(28,28,1)) )
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

tensor_board = TensorBoard('./logs/LeNet-MNIST-1')

model.fit(X_train, y_train, batch_size=16, epochs=15, verbose=1, validation_data=(X_test,y_test), callbacks=[tensor_board])

这是运行上面的代码的输出:

Using TensorFlow backend.
Train on 60000 samples, validate on 10000 samples
2018-12-21 21:28:32.425989: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-12-21 21:28:33.111624: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-21 21:28:33.112435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.65
pciBusID: 0000:09:00.0
totalMemory: 7.77GiB freeMemory: 7.65GiB
2018-12-21 21:28:33.112452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-21 21:28:33.380127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-21 21:28:33.380166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-21 21:28:33.380172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-21 21:28:33.380625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7359 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:09:00.0, compute capability: 7.5)
Epoch 1/15
Segmentation fault (core dumped)

当查看nvidia-smi窗口时,它会显示约1秒钟的使用情况,然后变为零,然后从终端获得分段错误。我试图在Jupyter中运行它,而内核就死了。我唯一能想到的就是我已安装的程序的版本。这是这些程序的版本:

海湾合作委员会:

gcc version 6.5.0 20181026 (Ubuntu 6.5.0-2ubuntu1~18.04) 

CUDA:

CUDA Version 9.0.176
CUDA Patch Version 9.0.176.4

Tensorflow:

1.12.0

CUDNN:

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

我的Nvidia SMI看起来像这样: + ------------------------------------------------- ---------------------

-------+
| NVIDIA-SMI 415.23       Driver Version: 415.23       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   46C    P0     1W / 175W |      0MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

根据上面提到的博客,该代码似乎很容易在rtx 2070上运行,但它不想运行。有什么建议吗?

1 个答案:

答案 0 :(得分:0)

我实际上在博客的底部找到了答案(带有上面提到的MNIST示例)作为评论。这是评论:

  

好的,我发现这可能与   不同版本的cudnn。我用conda创建了一个新环境   指定python = 3.6(conda create --name tf-gpu python = 3.6),并   然后安装tensorflow-gpu = 1.8.0(conda安装   tensorflow-gpu = 1.8.0)。我仍然想知道为什么会这样   但至少现在此页面上的所有代码都可以顺利运行。

我使用这些特定的安装创建了一个新的conda环境,并且代码现在可以平稳运行。如果其他人遇到此问题,我将保留在此发布,因为我最初的问题不是此MNIST代码,而是其他一些代码。