与Theano后端的Keras是否应该比Tensorflow后端慢18倍?

时间:2017-02-10 15:13:24

标签: tensorflow theano keras

我刚刚在我的机器上安装了keras,tensorflow和theano,并使用tensorflow作为后端使用theano作为后端快速比较了keras。结果比我预期的更加极端。

我使用的是3个软件包的以下版本:

>>> theano.__version__
'0.8.2'
>>> tensorflow.__version__
'0.12.1'
>>> keras.__version__
'1.2.1'

为了比较两个后端,我使用cifar10_cnn.py当我使用tensorflow作为我的后端时,我得到了这个结果:

deep@deep-Precision-7710:~/Downloads/keras/examples$ python cifar10_cnn.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/200
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least   one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Quadro M5000M
major: 5 minor: 2 memoryClockRate (GHz) 1.0505
pciBusID 0000:01:00.0
Total memory: 7.93GiB
Free memory: 7.59GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M5000M, pci bus id: 0000:01:00.0)
50000/50000 [==============================] - 16s - loss: 1.7916 - acc:    0.3350 - val_loss: 1.4998 - val_acc: 0.4490
Epoch 2/200
50000/50000 [==============================] - 15s - loss: 1.4020 - acc: 0.4907 - val_loss: 1.2039 - val_acc: 0.5779
Epoch 3/200
50000/50000 [==============================] - 15s - loss: 1.2460 - acc: 0.5531 - val_loss: 1.0272 - val_acc: 0.6311

整个过程需要51 mins才能运行。

当我使用Theano作为我的后端时,我得到了这个结果:

deep@deep-Precision-7710:~/Downloads/keras/examples$ python cifar10_cnn.py 
Using Theano backend.
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/200
50000/50000 [==============================] - 292s - loss: 1.8008 - acc: 0.3286 - val_loss: 1.4991 - val_acc: 0.4613
Epoch 2/200
50000/50000 [==============================] - 285s - loss: 1.4302 - acc: 0.4774 - val_loss: 1.1840 - val_acc: 0.5737
Epoch 3/200
50000/50000 [==============================] - 288s - loss: 1.2690 - acc: 0.5452 - val_loss: 1.0930 - val_acc: 0.6030

这需要大约15.75小时才能运行!?

我应该感到惊讶的是keras theano后端比keras tensorflow后端的 18X - 19X 慢吗? (我包括我正在使用的cifar10_cnn.py的版本)IN从一个后端转换到另一个后端,我只是改变了keras.json中的后端规范。我没有调整image_dim_ordering,因为这似乎是由数据集指定的。

'''Train a simple deep CNN on the CIFAR10 small images dataset.

GPU run command with Theano backend (with TensorFlow, the GPU is automatically used):
    THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python cifar10_cnn.py

It gets down to 0.65 test logloss in 25 epochs, and down to 0.55 after 50 epochs.
(it's still underfitting at that point, though).
'''

from __future__ import print_function
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import SGD
from keras.utils import np_utils

import datetime

batch_size = 32
nb_classes = 10
nb_epoch = 200
data_augmentation = True

# input image dimensions
img_rows, img_cols = 32, 32
# The CIFAR10 images are RGB.
img_channels = 3

# The data, shuffled and split between train and test sets:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()

model.add(Convolution2D(32, 3, 3, border_mode='same',
                        input_shape=X_train.shape[1:]))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Convolution2D(64, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# Let's train the model using SGD + momentum (how original).
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

if not data_augmentation:
    print('Not using data augmentation.')
    model.fit(X_train, Y_train,
              batch_size=batch_size,
              nb_epoch=nb_epoch,
              validation_data=(X_test, Y_test),
              shuffle=True)
else:
    print('Using real-time data augmentation.')
    # This will do preprocessing and realtime data augmentation:
    datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

    # Compute quantities required for featurewise normalization
    # (std, mean, and principal components if ZCA whitening is applied).
    datagen.fit(X_train)

    start = datetime.datetime.now()
    # Fit the model on the batches generated by datagen.flow().
    model.fit_generator(datagen.flow(X_train, Y_train,
                                     batch_size=batch_size),
                        samples_per_epoch=X_train.shape[0],
                        nb_epoch=nb_epoch,
                        validation_data=(X_test, Y_test))
    stop = datetime.datetime.now()
    print("\nTime to run:",stop-start)

0 个答案:

没有答案