Question

您好，我有一条数据处理管道，我想通过在CPU上同时运行一些处理线程，同时在GPU上运行MXNet预测模型（Python 3.6）来对其进行优化。

我想应用的想法如下（假设我的机器上有N个GPU）：

GPU Job Dispatcher从视频中读取N帧序列，并将每一帧发送到一个GPU。
每个GPU处理其框架并使用MXNet预测其内容。
所有N个GPU完成预测后，我想同时执行以下操作：
1. 将预测输出发送到队列。
2. 读取并处理GPU中的下N个帧。
队列由运行在CPU上的多线程进程消耗。

这是工作流程的直观描述：

想法是在GPU忙于处理帧时利用空闲的CPU。

通过使用线程库，我成功读取和处理了前N个帧，但是GPU无法处理下一批帧。

请注意，下面的源代码经过简化以阐明工作流程。

这是读取帧并将其分派到GPU，然后将输出发送到CPU队列的函数代码：

def dispatch_jobs(video_capture, detection_workers, number_of_gpu, cpu_queue):
    # detection_workers is a list of N similar MXNet models, each one works on a different GPU
    is_last_frame = False
    while not is_last_frame:
        frames_batch = []
        for i in range(0, number_of_gpu):
            success, frame = read_frame_from_video(video_capture)
            if not success:
                logging.warning("Can't receive frame. Exiting.")
                is_last_frame = True
                break
            frames_batch.append(frame)

        workers = []
        for detection_worker_id in range(0, len(frames_batch)):
            frame_image = frames_batch[detection_worker_id]
            thread = Thread(target=detection_workers[detection_worker_id].predict, kwargs={'image': frame_image})
            workers.append(thread)

        for w in workers: w.start()
        for w in workers: w.join()

        # sending to the CPU queue
        for detection_worker_id in range(0, len(frames_batch)):
            detector_output = detection_workers[detection_worker_id].output
            cpu_queue.put(detector_output)

    logging.info("While loop is broken... putting -1 in the queue")
    cpu_queue.put(-1)

    return

如上所述，有一个使用者线程从cpu_queue读取输出，并将其发送到多线程函数（在CPU上），这是使用者函数的代码：

def consume_cpu_queue(cpu_queue):
    while cpu_queue.empty():
        logging.info("Sleeping 1 second")
        time.sleep(1)

    prediction_output = cpu_queue.get()
    if prediction_output == -1:
        return

    process_output_multithread(prediction_output)
    consume_cpu_queue()

def process_output_multithread(pred_output, number_of_process):
    workers = []
    for i in range(0, number_of_process):
        thread = Thread(target=process, kwargs={'pred_output': pred_output})
        workers.append(thread)

    for w in workers: w.start()
    for w in workers: w.join()
    return

# Here is how the consumer thread is initiated
cpu_consumer_thread = Thread(target=consume_cpu_queue)

# Here is how I run the application
cpu_consumer_thread.start()
dispatch_jobs(video_capture, detection_workers)
cpu_consumer_thread.join()

我已经检查过this question，但是不确定Numba是否可以解决我的问题。

任何建议或指示都会很有帮助。

使用Python在GPU（MXNet）和CPU上并行处理

0 个答案: