我用Keras / TensorFlow训练以下NN的图层:6 LSTM + 42 LSTM + 42 LSTM + 2 Dense。 输入张量的形状(653015,240,6),所以它很大(~3.5 GB)。
我在paperspace.com上尝试了使用专用NVIDIA Quadro P4000的云GPU实例。一个时期花了大约9000秒。当我试着在我家的旧款AMD Athlon X4 760K上下雨时,一个时代花了大约7000秒。
当火车开始时,它显示:
2018-04-05 10:08:22.845846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Quadro P4000
major: 6 minor: 1 memoryClockRate (GHz) 1.48
pciBusID 0000:00:05.0
Total memory: 7.92GiB
Free memory: 7.59GiB
2018-04-05 10:08:22.845885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2018-04-05 10:08:22.845892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2018-04-05 10:08:22.845900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro P4000, pci bus id: 0000:00:05.0)
所以,它绝对使用GPU而不是CPU。
nvidia-smi表示:
Thu Apr 5 11:46:38 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98 Driver Version: 384.98 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:00:05.0 On | N/A |
| 46% 41C P0 29W / 105W | 7799MiB / 8114MiB | 21% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2593 G /usr/lib/xorg/Xorg 169MiB |
| 0 2913 G /usr/bin/gnome-shell 106MiB |
| 0 3360 C python3 7511MiB |
+-----------------------------------------------------------------------------+
nvidia-smi dmon:
# gpu pwr temp sm mem enc dec mclk pclk
# Idx W C % % % % MHz MHz
0 32 43 21 1 0 0 3802 1202
0 29 43 19 2 0 0 3802 1202
0 31 43 15 1 0 0 3802 1202
0 32 43 18 1 0 0 3802 1202
0 29 43 20 1 0 0 3802 1202
0 29 43 20 2 0 0 3802 1202
0 29 43 19 1 0 0 3802 1202
当我使用CuDNNLSTM替换NN中的LSTM层时,NVIDIA培训大约需要3800秒。但它只比旧CPU快2倍。
有人可以解释如何解决为什么NVIDIA在这次培训中如此缓慢? 我不知道该怎么做。
本文作者https://medium.com/initialized-capital/benchmarking-tensorflow-performance-and-cost-across-different-gpu-options-69bd85fe5d58证明了paperpace实例比Core i7快得多。