Question

我的代码优先：

from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot
from numpy import where
from keras.utils import to_categorical

from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import multi_gpu_model

X, y = make_blobs(n_samples=1000000, centers=3, n_features=3, cluster_std=2, random_state=2)

y = to_categorical(y)

n_train = 500000
trainX, testX = X[:n_train, :], X[n_train:, :]
trainY, testY = y[:n_train], y[n_train:]

model = Sequential()
model.add(Dense(50, input_dim=3, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
p_model = multi_gpu_model(model, gpus=4)

opt = SGD(lr=0.01, momentum=0.9)
p_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

history = p_model.fit(trainX, trainY, validation_data=(testX, testY), epochs=20, verbose=1, batch_size=32)
_, train_acc = p_model.evaluate(trainX, trainY, verbose=0)
_, test_acc = p_model.evaluate(testX, testY, verbose=0)


print("Train: %.3f, Test: %.3f" % (train_acc, test_acc))

使用4个功能强大的GPU的方法如下：

p_model = multi_gpu_model(model, gpus=4)

培训进行中，我可以看到以下内容（GPU已被充分利用？）：

(tf_gpu) [martin@A08-R32-I196-3-FZ2LTP2 mlm]$ nvidia-smi 
Wed Jan 23 09:08:24 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:02:00.0 Off |                    0 |
| N/A   29C    P0    49W / 250W |  21817MiB / 22919MiB |      9%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:04:00.0 Off |                    0 |
| N/A   34C    P0    50W / 250W |  21817MiB / 22919MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:83:00.0 Off |                    0 |
| N/A   28C    P0    48W / 250W |  21817MiB / 22919MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   36C    P0    51W / 250W |  21817MiB / 22919MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    114918      C   python                                     21807MiB |
|    1    114918      C   python                                     21807MiB |
|    2    114918      C   python                                     21807MiB |
|    3    114918      C   python                                     21807MiB |
+-----------------------------------------------------------------------------+

但是，与单GPU运行相比，甚至与我的Mac桌面运行相比，它根本无法提高速度。整个培训大约需要20分钟，几乎与单GPU培训所用的时间相同，并且比我个人Mac上的培训要慢得多。为什么会这样？

Answer 1

将batch_size增加32以上。继续增加，直到完全利用GPU。是的，这可能会影响您的模型，但确实会显着提高性能。您将不得不为该batch_size找到一个最佳选择。

具有4-GPU的TensorFlow不能加快培训速度

1 个答案: