Question

我正在尝试使用FLASK API为Facenet模型创建推理服务器，以进行图像匹配任务。我正在使用Gunicorn缩放服务器，服务器使用POST请求以字符串序列的形式从客户端获取图像。服务器获取该图像并将其与mongodb数据库中的图像进行匹配，然后找到距离。

当使用Gunicorn运行应用程序时，服务器会调用tensorflow来加载模型，并创建gunicorn worker实例，我可以使用nvidia-smi pmon看到它，但是当我使用客户端向该服务器发送调用时，仅使用了GPU 0，甚至没有像在没有服务器/客户端的情况下运行GPU 0那样无法使用。我的gunicorn电话使用的是gevent工人阶级，我的电话看起来像这样：

gunicorn --bind 0.0.0.0:5000 --timeout 1000000 -w 4 -k gevent wsgi:app

我有4个GPU，当服务器在上述调用中运行时，nvidia-smi pmon的输出如下：

0      93715     C     0     0     0     0   python         
0      93716     C     0     0     0     0   python         
0      93717     C     0     0     0     0   python         
0      93719     C     3     0     0     0   python         
1      93715     C     0     0     0     0   python         
1      93716     C     0     0     0     0   python         
1      93717     C     0     0     0     0   python         
1      93719     C     0     0     0     0   python         
2      93715     C     0     0     0     0   python         
2      93716     C     0     0     0     0   python         
2      93717     C     0     0     0     0   python         
2      93719     C     0     0     0     0   python         
3      93715     C     0     0     0     0   python         
3      93716     C     0     0     0     0   python         
3      93717     C     0     0     0     0   python         
3      93719     C     0     0     0     0   python         
0      93715     C     0     0     0     0   python         
0      93716     C     0     0     0     0   python         
# gpu    pid  type    sm   mem   enc   dec   command
# Idx      #   C/G     %     %     %     %   name
0      93717     C     0     0     0     0   python         
0      93719     C     2     0     0     0   python         
1      93715     C     0     0     0     0   python         
1      93716     C     0     0     0     0   python         
1      93717     C     0     0     0     0   python         
1      93719     C     0     0     0     0   python         
2      93715     C     0     0     0     0   python         
2      93716     C     0     0     0     0   python         
2      93717     C     0     0     0     0   python         
2      93719     C     0     0     0     0   python         
3      93715     C     0     0     0     0   python         
3      93716     C     0     0     0     0   python         
3      93717     C     0     0     0     0   python         
3      93719     C     0     0     0     0   python         
0      93715     C     0     0     0     0   python         
0      93716     C     0     0     0     0   python         
0      93717     C     0     0     0     0   python         
0      93719     C     3     0     0     0   python

从上面可以看出，只有GPU 0才能获得所有调用，并且使用率只有3-5％。我的没有服务器-客户端模型的测试代码能够直接在每个GPU上达到25％的使用率。有人可以解释我做错了什么还是应该尝试的其他事情？

Answer 1

仅使用第一个GPU的问题是默认情况下，即使Tensorflow可以看到所有4个GPU，它也只会使用第一个GPU。

尽管Gunicorn发送了多个调用，并且每个调用都试图调用其Tensorflow，但它们都看到4个GPU，并且使用第一个GPU。

我认为一种可能的解决方案是让4个不同的Flask或gunicorns配置为环境变量“ CUDA_VISIBLE_DIVICES”分别为0、1、2、3。然后，您应该使用Nginx将api调用转发到这4个服务器。

我不确定GPU内存使用率低的原因。

使用GUNICORN和FLASK问题在GPU上通过FaceNet运行图像匹配服务

1 个答案: