使用Python在GPU(MXNet)和CPU上并行处理

时间:2020-05-14 08:53:50

标签: python-3.x multithreading parallel-processing gpu mxnet

您好,我有一条数据处理管道,我想通过在CPU上同时运行一些处理线程,同时在GPU上运行MXNet预测模型(Python 3.6)来对其进行优化。

我想应用的想法如下(假设我的机器上有N个GPU):

  • GPU Job Dispatcher从视频中读取N帧序列,并将每一帧发送到一个GPU。
  • 每个GPU处理其框架并使用MXNet预测其内容。
  • 所有N个GPU完成预测后,我想同时执行以下操作:
    1. 将预测输出发送到队列。
    2. 读取并处理GPU中的下N个帧。
  • 队列由运行在CPU上的多线程进程消耗。

这是工作流程的直观描述:

enter image description here

想法是在GPU忙于处理帧时利用空闲的CPU。

通过使用线程库,我成功读取和处理了前N个帧,但是GPU无法处理下一批帧。

请注意,下面的源代码经过简化以阐明工作流程。

这是读取帧并将其分派到GPU,然后将输出发送到CPU队列的函数代码:

def dispatch_jobs(video_capture, detection_workers, number_of_gpu, cpu_queue):
    # detection_workers is a list of N similar MXNet models, each one works on a different GPU
    is_last_frame = False
    while not is_last_frame:
        frames_batch = []
        for i in range(0, number_of_gpu):
            success, frame = read_frame_from_video(video_capture)
            if not success:
                logging.warning("Can't receive frame. Exiting.")
                is_last_frame = True
                break
            frames_batch.append(frame)

        workers = []
        for detection_worker_id in range(0, len(frames_batch)):
            frame_image = frames_batch[detection_worker_id]
            thread = Thread(target=detection_workers[detection_worker_id].predict, kwargs={'image': frame_image})
            workers.append(thread)

        for w in workers: w.start()
        for w in workers: w.join()

        # sending to the CPU queue
        for detection_worker_id in range(0, len(frames_batch)):
            detector_output = detection_workers[detection_worker_id].output
            cpu_queue.put(detector_output)

    logging.info("While loop is broken... putting -1 in the queue")
    cpu_queue.put(-1)

    return

如上所述,有一个使用者线程从cpu_queue读取输出,并将其发送到多线程函数(在CPU上),这是使用者函数的代码:

def consume_cpu_queue(cpu_queue):
    while cpu_queue.empty():
        logging.info("Sleeping 1 second")
        time.sleep(1)

    prediction_output = cpu_queue.get()
    if prediction_output == -1:
        return

    process_output_multithread(prediction_output)
    consume_cpu_queue()

def process_output_multithread(pred_output, number_of_process):
    workers = []
    for i in range(0, number_of_process):
        thread = Thread(target=process, kwargs={'pred_output': pred_output})
        workers.append(thread)

    for w in workers: w.start()
    for w in workers: w.join()
    return

# Here is how the consumer thread is initiated
cpu_consumer_thread = Thread(target=consume_cpu_queue)

# Here is how I run the application
cpu_consumer_thread.start()
dispatch_jobs(video_capture, detection_workers)
cpu_consumer_thread.join()

我已经检查过this question,但是不确定Numba是否可以解决我的问题。

任何建议或指示都会很有帮助。

0 个答案:

没有答案