假设我有两个神经网络,分别表示为Python类A和B。方法A.run()
和B.run()
代表对一张图像的前馈推断。
例如,A.run()
花费100毫秒,而B.run()
花费50毫秒。
一个接一个地跑,即
img = cap.read()[1] # e.g. cv2.VideoCapture instance
start_time = time.time()
A.run(img) # 100 ms
B.run(img) # 50 ms
time_diff = time.time() - start_time # 100 + 50 = 150 ms
推理时间总计不超过150毫秒。
为了更快地执行此操作,我们可以尝试并行化,以便它们同时启动。下面概述了使用Python线程的实现:
class A:
# This method is spawned using Python's threading library
def run_queue(self, input_queue, output_queue):
while True:
img = input_queue.get()
start_time = time.time()
output = self.run(img)
time_diff = time.time() - start_time # Supposedly 100 ms for class A, and 50 ms for class B
# in main program flow:
# Assume that a_input_queue and a_output_queue are tied to an instance of class A
# And similar for class B
img = cap.read()[1]
a_input_queue.put(img)
b_input_queue.put(img)
start_time = time.time()
a_output = a_output_queue.get() # Should take 100 ms
b_output = b_output_queue.get() # B.run() should take 50 ms, but since it started at the same time as A.run(), this get() should effectively return immediately
time_diff = time.time() - start_time # Should theoretically be 100 ms
因此,从理论上讲,我们只应成为A的瓶颈,并且整个系统的结束时间为100毫秒。
但是,似乎在B.run_queue()
中测量B.run()大约也需要100毫秒。由于它们大约在同一时间启动,因此整个系统大约还需要100毫秒。
这有意义吗?如果最终得出的总推理时间大致相同(或者至少可能更快地递增),是否尝试对两个神经网络进行线程化处理?
我的猜测是,对于一个神经网络,GPU的最大利用率为100%,因此当尝试同时推断两个网络时,它只是重新排列了指令,但无论如何只能执行相同数量的计算:
Illustration:
A.run() executes 8 blocks of instructions:
| X | X | X | X | X | X | X | X |
B.run() executes only 4 blocks of instructions:
| Y | Y | Y | Y |
Now, say that the GPU can process 2 blocks of instructions per second.
So, in the case that A.run() and B.run() are ran one after the other (non-threaded):
| X | X | X | X | X | X | X | X | Y | Y | Y | Y | -> A.run() takes 4 s, B.run() takes 2 s, everything takes 6 s
In the threaded case, the instructions are rearranged so both start at the same time, but get stretched out:
| X | X | Y | X | X | Y | X | X | Y | X | X | Y -> A.run() roughly takes 6 s, B.run() roughly takes 6 s, everything seems to take 6 s
上面的例子是这样吗?
最后,让我们考虑一个类似于B的类C(例如,推理时间= 50 ms),只是它使用CPU。因此,它在GPU使用上不应该与A竞争,但从实验来看,它的行为就像B。它的推断时间似乎被拉长到与A相匹配。
有什么想法吗?预先感谢。