Question

假设我有两个神经网络，分别表示为Python类A和B。方法A.run()和B.run()代表对一张图像的前馈推断。

例如，A.run()花费100毫秒，而B.run()花费50毫秒。

一个接一个地跑，即

img = cap.read()[1]  # e.g. cv2.VideoCapture instance
start_time = time.time()
A.run(img)  # 100 ms
B.run(img)  # 50 ms
time_diff = time.time() - start_time  # 100 + 50 = 150 ms

推理时间总计不超过150毫秒。

为了更快地执行此操作，我们可以尝试并行化，以便它们同时启动。下面概述了使用Python线程的实现：

class A:
    # This method is spawned using Python's threading library
    def run_queue(self, input_queue, output_queue):
        while True:
            img = input_queue.get()
            start_time = time.time()
            output = self.run(img)  
            time_diff = time.time() - start_time # Supposedly 100 ms for class A, and 50 ms for class B


# in main program flow:            
# Assume that a_input_queue and a_output_queue are tied to an instance of class A
# And similar for class B

img = cap.read()[1]
a_input_queue.put(img)
b_input_queue.put(img)

start_time = time.time()
a_output = a_output_queue.get()  # Should take 100 ms
b_output = b_output_queue.get()  # B.run() should take 50 ms, but since it started at the same time as A.run(), this get() should effectively return immediately
time_diff = time.time() - start_time  # Should theoretically be 100 ms

因此，从理论上讲，我们只应成为A的瓶颈，并且整个系统的结束时间为100毫秒。

但是，似乎在B.run_queue()中测量B.run（）大约也需要100毫秒。由于它们大约在同一时间启动，因此整个系统大约还需要100毫秒。

这有意义吗？如果最终得出的总推理时间大致相同（或者至少可能更快地递增），是否尝试对两个神经网络进行线程化处理？

我的猜测是，对于一个神经网络，GPU的最大利用率为100％，因此当尝试同时推断两个网络时，它只是重新排列了指令，但无论如何只能执行相同数量的计算：

Illustration:
A.run() executes 8 blocks of instructions:
| X | X | X | X | X | X | X | X |
B.run() executes only 4 blocks of instructions:
| Y | Y | Y | Y |

Now, say that the GPU can process 2 blocks of instructions per second.

So, in the case that A.run() and B.run() are ran one after the other (non-threaded):
| X | X | X | X | X | X | X | X | Y | Y | Y | Y | -> A.run() takes 4 s, B.run() takes 2 s, everything takes 6 s

In the threaded case, the instructions are rearranged so both start at the same time, but get stretched out:
| X | X | Y | X | X | Y | X | X | Y | X | X | Y  -> A.run() roughly takes 6 s, B.run() roughly takes 6 s, everything seems to take 6 s

上面的例子是这样吗？

最后，让我们考虑一个类似于B的类C（例如，推理时间= 50 ms），只是它使用CPU。因此，它在GPU使用上不应该与A竞争，但从实验来看，它的行为就像B。它的推断时间似乎被拉长到与A相匹配。

有什么想法吗？预先感谢。

在一个GPU上对两个不同的神经网络进行多线程处理有意义吗？

0 个答案: