Question

使用Keras与Tensorflow后端，我正在尝试训练LSTM网络，并且在GPU上运行它比在CPU上运行要花费更长的时间。

我正在使用fit_generator函数训练LSTM网络。每个纪元需要CPU〜250秒，而每个纪元需要GPU〜900秒。我的GPU环境中的软件包包括

keras-applications        1.0.8                      py_0    anaconda
keras-base                2.2.4                    py36_0    anaconda
keras-gpu                 2.2.4                         0    anaconda
keras-preprocessing       1.1.0                      py_1    anaconda
...
tensorflow                1.13.1          gpu_py36h3991807_0    anaconda
tensorflow-base           1.13.1          gpu_py36h8d69cac_0    anaconda
tensorflow-estimator      1.13.0                     py_0    anaconda
tensorflow-gpu            1.13.1                   pypi_0    pypi

我的Cuda编译工具的版本为9.1.85，而我的CUDA和驱动程序的版本为

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    On   | 00000000:0A:00.0 Off |                  N/A |
|  0%   39C    P8     5W / 225W |   7740MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    On   | 00000000:42:00.0 Off |                  N/A |
|  0%   33C    P8    19W / 225W |    142MiB /  7951MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     49251      C   .../whsu014/.conda/envs/whsuphd/bin/python  7729MiB |
|    1      1354      G   /usr/lib/xorg/Xorg                            16MiB |
|    1     49251      C   .../whsu014/.conda/envs/whsuphd/bin/python   113MiB |
+-----------------------------------------------------------------------------+

当我插入此行代码

tf.Session(config = tf.configProto(log_device_placement = True)):

我在终端机中看到了以下内容

...
ining_1/Adam/Const_10: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/Const_11: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720653: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/Const_11: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/add_15/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720666: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/add_15/y: (Const)/job:localhost/replica:0/task:0/device:GPU:0
...

因此，看来Tensorflow正在使用GPU。

当我分析代码时，在GPU上，这是前10行

10852017 function calls (10524203 primitive calls) in 184.768 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    16200  173.827    0.011  173.827    0.011 {built-in method _pywrap_tensorflow_internal.TF_SessionRunCallable}
        6    0.926    0.154    0.926    0.154 {built-in method _pywrap_tensorflow_internal.TF_SessionMakeCallable}
       62    0.813    0.013    0.813    0.013 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
   156954    0.414    0.000    0.415    0.000 {built-in method numpy.array}
    16200    0.379    0.000    1.042    0.000 training.py:643(_standardize_user_data)
    24300    0.338    0.000    0.338    0.000 {method 'partition' of 'numpy.ndarray' objects}
       68    0.301    0.004    0.301    0.004 {built-in method _pywrap_tensorflow_internal.ExtendSession}
    32458    0.223    0.000    2.122    0.000 tensorflow_backend.py:156(get_session)
     3206    0.212    0.000    0.238    0.000 tf_stack.py:31(extract_stack)
    76024    0.210    0.000    0.702    0.000 ops.py:5246(get_controller)
...

在CPU上，这是前10行

22123473 function calls (21647174 primitive calls) in 60.173 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    16269   42.491    0.003   42.491    0.003 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_Run}
    16269    0.568    0.000   48.964    0.003 session.py:1042(_run)
       56    0.532    0.010    0.532    0.010 {built-in method time.sleep}
   153641    0.458    0.000    0.460    0.000 {built-in method numpy.core.multiarray.array}
183148/125354    0.447    0.000    1.316    0.000 python_message.py:469(init)
  1226659    0.362    0.000    0.364    0.000 {built-in method builtins.getattr}
2302110/2301986    0.339    0.000    0.358    0.000 {built-in method builtins.isinstance}
        8    0.285    0.036    0.285    0.036 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_ExtendGraph}
    12150    0.267    0.000    0.271    0.000 callbacks.py:211(on_batch_end)
147026/49078    0.264    0.000    1.429    0.000 python_message.py:1008(ByteSize)
...

这是我的代码。

def train_generator(x_list, y_list):
    # 0.1 validatioin split
    train_length = (len(x_list)//10)*9
    while True:
        for i in range(train_length):
            train_x = np.array([x_list[i]])
            train_y = np.array([y_list[i]])
            yield train_x, train_y

def val_generator(x_list, y_list):
    # 0.1 validation split
    val_length = len(x_list)//10
    while True:
        for i in range(-val_length, 0, 1):
            val_x = np.array([x_list[i]])
            val_y = np.array([y_list[i]])
            yield val_x, val_y



with tf.Session(config = tf.ConfigProto(log_device_placement = True)):
model = Sequential()
model.add(LSTM(64, return_sequences=False,
               input_shape=(None, 24)))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
checkpointer = ModelCheckpoint(filepath="weights.hdf5",
                               monitor='val_loss', verbose=1,
                               save_best_only=True)

history = model.fit_generator(generator=train_generator(train_x,
                                                        train_y),
                              steps_per_epoch=(len(train_x)//10)*9,
                              epochs=5,
                              validation_data=val_generator(train_x,
                                                            train_y),
                              validation_steps=len(train_x)//10,
                              callbacks=[checkpointer],
                              verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='validation')
pyplot.legend()
pyplot.show()

我希望使用GPU进行训练时速度会大大提高。我怎样才能解决这个问题？有人可以帮助我了解造成减速的原因吗？谢谢。

Answer 1

观察对：

使用CuDNNLSTM而不是LSTM在GPU上进行训练，您会发现速度显着提高。
有时，对于非常小的网络，CPU和GPU之间的传输开销超过了GPU上的并行计算；换句话说，传输数据所浪费的时间比在GPU上进行培训所花费的时间要多。

GPU应该用于高度密集的任务和计算（很大的LSTM /沉重的CNN网络）。但是，对于小型MLP甚至LSTM，您可能会发现网络在CPU上的训练速度比在GPU上更快。

Tensorflow在GPU上比在CPU上慢

1 个答案: