Keras模型始终仅使用一个GPU

时间:2019-05-05 12:45:12

标签: python tensorflow amazon-ec2 keras

我正在尝试在具有8个GPU的AWS EC2 p3.16xlarge实例上训练CNN模型。当我使用500的批处理大小时,即使系统有8个GPU,也一直都只使用一个GPU。当我将批处理大小增加到1000时,它仅使用GPU,与500情况相比,它的确变慢了。如果我将批处理大小增加到2000,则会发生内存溢出。如何解决此问题?

我正在使用tensorflow后端。 GPU利用率如下,

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   47C    P0    69W / 300W |  15646MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   44C    P0    59W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   45C    P0    61W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   47C    P0    64W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   48C    P0    62W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   46C    P0    61W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   46C    P0    65W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   46C    P0    63W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15745      C   python3                                    15635MiB |
|    1     15745      C   python3                                      491MiB |
|    2     15745      C   python3                                      491MiB |
|    3     15745      C   python3                                      491MiB |
|    4     15745      C   python3                                      491MiB |
|    5     15745      C   python3                                      491MiB |
|    6     15745      C   python3                                      491MiB |
|    7     15745      C   python3                                      491MiB |
+-----------------------------------------------------------------------------+

1 个答案:

答案 0 :(得分:1)

您可能正在寻找multiple_gpu_model。您可以在keras documentation中看到它。

您可以只做模型并进行parallel_model = multi_gpu_model(model, gpus=n_gpus)

下次不要忘记添加minimal working exemple