Question

我用Keras / TensorFlow训练以下NN的图层：6 LSTM + 42 LSTM + 42 LSTM + 2 Dense。输入张量的形状（653015,240,6），所以它很大（~3.5 GB）。

我在paperspace.com上尝试了使用专用NVIDIA Quadro P4000的云GPU实例。一个时期花了大约9000秒。当我试着在我家的旧款AMD Athlon X4 760K上下雨时，一个时代花了大约7000秒。

当火车开始时，它显示：

2018-04-05 10:08:22.845846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Quadro P4000
major: 6 minor: 1 memoryClockRate (GHz) 1.48
pciBusID 0000:00:05.0
Total memory: 7.92GiB
Free memory: 7.59GiB
2018-04-05 10:08:22.845885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2018-04-05 10:08:22.845892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y
2018-04-05 10:08:22.845900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro P4000, pci bus id: 0000:00:05.0)

所以，它绝对使用GPU而不是CPU。

nvidia-smi表示：

Thu Apr  5 11:46:38 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P4000        Off  | 00000000:00:05.0  On |                  N/A |
| 46%   41C    P0    29W / 105W |   7799MiB /  8114MiB |     21%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2593      G   /usr/lib/xorg/Xorg                           169MiB |
|    0      2913      G   /usr/bin/gnome-shell                         106MiB |
|    0      3360      C   python3                                     7511MiB |
+-----------------------------------------------------------------------------+

nvidia-smi dmon：

# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0    32    43    21     1     0     0  3802  1202
    0    29    43    19     2     0     0  3802  1202
    0    31    43    15     1     0     0  3802  1202
    0    32    43    18     1     0     0  3802  1202
    0    29    43    20     1     0     0  3802  1202
    0    29    43    20     2     0     0  3802  1202
    0    29    43    19     1     0     0  3802  1202

当我使用CuDNNLSTM替换NN中的LSTM层时，NVIDIA培训大约需要3800秒。但它只比旧CPU快2倍。

有人可以解释如何解决为什么NVIDIA在这次培训中如此缓慢？我不知道该怎么做。

本文作者https://medium.com/initialized-capital/benchmarking-tensorflow-performance-and-cost-across-different-gpu-options-69bd85fe5d58证明了paperpace实例比Core i7快得多。

强大的NVIDIA比旧的AMD CPU慢

0 个答案: