Python3多处理队列和多个线程无法通过join()正确完成?

时间:2018-11-02 05:03:45

标签: python python-3.6

尝试利用multiprocessing.Queue和threading.Thread来拆分大量任务(监视摄像机的运行状况)。鉴于下面的代码,我想知道何时检查了所有摄像机(有32,000多个),但是我的输出似乎从未到达main中的打印语句。

每个queue_worker都调用“ process_camera”,该进程当前执行所有运行状况检查并返回一个值(此部分有效!)。

当我看着它运行时,我发现它变得有点“完整”和“挂起”,因此有些东西被阻塞或导致它无法完成...我尝试了get()和join( )带有超时参数的语句,但这似乎根本没有效果!

我已经盯着这段代码和文档了3天了……有什么明显的我没看到吗?

最终目标是先检查所有30,000个摄像机(在脚本启动时加载到all_cameras中),然后进行“循环”并继续执行直到用户中止脚本为止。

def queue_worker(camera_q, result_q):
    '''
    Function takes camera off the queue and calls healthchecks
    '''

    try:
        camera = camera_q.get()
        camera_status, remove_camera = process_camera(camera)

        result_q.put("Success")
        return True
    except queue.Empty:
        logging.info("Queue is empty")
        result_q.put("Fail")
        return False


def process_worker(camera_q, result_q, process_num, stop_event):
    while not stop_event.is_set():
        # Create configured number of threads and provide references to both Queues to each thread
        threads = []
        for i in range(REQUEST_THREADS):
            thread = threading.Thread(target=queue_worker, args=(camera_q, result_q))
            thread.setName("CameraThread-{}".format(i))
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join(timeout=120)

        if camera_q.empty():
            num_active = sum([t.is_alive() for t in threads])
            logging.info("[Process {}] << {} >> active threads and << {} >> cameras left to process. << {} >> processed.".format(process_num, num_active, camera_q.qsize(), result_q.qsize()))


def main():
    '''
    Main application entry
    '''

    logging.info("Starting Scan With << " + str(REQUEST_THREADS) + " Threads and " + str(CHILD_PROCESSES) + " Processors >>")
    logging.info("Reference Images Stored During Scan << " + str(store_images) + " >>")

    stop_event = multiprocessing.Event()
    camera_q, result_q = multiprocessing.Queue(), multiprocessing.Queue()

    # Create a Status thread for maintaining process status
    create_status_thread()

    all_cameras = get_oversite_cameras(True)
    for camera in all_cameras:
        camera_q.put(camera)

    logging.info("<< {} >> cameras queued up".format(camera_q.qsize()))

    processes = []
    process_num = 0
    finished_processes = 0
    for i in range(CHILD_PROCESSES):
        process_num += 1
        proc = multiprocessing.Process(target=process_worker, args=(camera_q, result_q, process_num, stop_event))
        proc.start()
        processes.append(proc)

    for proc in processes:
        proc.join()
        finished_processes += 1
        logging.info("{} finished processes".format(finished_pr))

    logging.info("All processes finished")

编辑:不确定是否有帮助(可视),但这是使用2000个摄像机的摄像机进行测试时当前输出的示例:

[2018-11-01 23:47:41,854] INFO - MainThread - root - Starting Scan With << 100 Threads and 16 Processors >>
[2018-11-01 23:47:41,854] INFO - MainThread - root - Reference Images Stored During Scan << False >>
[2018-11-01 23:47:41,977] INFO - MainThread - root - << 2000 >> cameras queued up
[2018-11-01 23:47:54,865] INFO - MainThread - root - [Process 3] << 0 >> active threads and << 0 >> cameras left to process. << 1570 >> processed.
[2018-11-01 23:47:56,009] INFO - MainThread - root - [Process 11] << 0 >> active threads and << 0 >> cameras left to process. << 1575 >> processed.
[2018-11-01 23:47:56,210] INFO - MainThread - root - [Process 14] << 0 >> active threads and << 0 >> cameras left to process. << 1579 >> processed.
[2018-11-01 23:47:56,345] INFO - MainThread - root - [Process 9] << 0 >> active threads and << 0 >> cameras left to process. << 1580 >> processed.
[2018-11-01 23:47:59,118] INFO - MainThread - root - [Process 2] << 0 >> active threads and << 0 >> cameras left to process. << 1931 >> processed.
[2018-11-01 23:47:59,637] INFO - MainThread - root - [Process 15] << 0 >> active threads and << 0 >> cameras left to process. << 1942 >> processed.
[2018-11-01 23:48:00,310] INFO - MainThread - root - [Process 8] << 0 >> active threads and << 0 >> cameras left to process. << 1945 >> processed.
[2018-11-01 23:48:00,445] INFO - MainThread - root - [Process 13] << 0 >> active threads and << 0 >> cameras left to process. << 1946 >> processed.
[2018-11-01 23:48:01,391] INFO - MainThread - root - [Process 10] << 0 >> active threads and << 0 >> cameras left to process. << 1949 >> processed.
[2018-11-01 23:48:01,527] INFO - MainThread - root - [Process 5] << 0 >> active threads and << 0 >> cameras left to process. << 1950 >> processed.
[2018-11-01 23:48:01,655] INFO - MainThread - root - [Process 6] << 0 >> active threads and << 0 >> cameras left to process. << 1951 >> processed.
[2018-11-01 23:48:02,519] INFO - MainThread - root - [Process 1] << 0 >> active threads and << 0 >> cameras left to process. << 1954 >> processed.
[2018-11-01 23:48:06,915] INFO - MainThread - root - [Process 12] << 0 >> active threads and << 0 >> cameras left to process. << 1981 >> processed.
[2018-11-01 23:48:27,339] INFO - MainThread - root - [Process 16] << 0 >> active threads and << 0 >> cameras left to process. << 1988 >> processed.
[2018-11-01 23:48:28,762] INFO - MainThread - root - [Process 4] << 0 >> active threads and << 0 >> cameras left to process. << 1989 >> processed.

它于1989年“吊死”,距2000年只有一年-这太难调试了!

1 个答案:

答案 0 :(得分:0)

确切地回答这个问题有点困难,因为它不是完整的清单。例如,create_status_thread()的实现被隐藏。这对于解决死锁尤其棘手,因为死锁通常是由对共享资源的特定访问序列引起的,并且create_status_thread可能具有其中之一。不过,有些建议:

  1. 您已经花了很多时间,因此花时间来制作一个简单的带有脚手架代码的示例不会受到伤害。我建议让它只使用虚拟方法而不是实际的相机。如果您还没有的话,我也将尝试使用较小的数字进行测试,并证明它首先适用于这些数字。这也将带来更好的StackOverflow问题;)
  2. 您到底需要多少个多线程? 30k摄像机听起来很多,但是如果每次检查都为2ms,则仍然每分钟检查一次。复杂性值得吗?您的SLA是多少?
  3. 当输入队列上还有未处理的项目时,
  4. Process.join()具有与您的描述类似的悬挂行为。如果您一直运行到终端输入,这似乎是可能的。您肯定在摘要中有很多输入事件,例如camera_q和result_q。参见https://docs.python.org/3.7/library/multiprocessing.html?highlight=process#programming-guidelines
  

请记住,将项目放入队列的进程将在终止之前等待,直到所有缓冲的项目由“ feeder”线程馈送到基础管道为止。 (子进程可以调用队列的Queue.cancel_join_thread方法来避免这种行为。)

     

这意味着,每当您使用队列时,都需要确保在加入该进程之前,将最终删除队列中已放置的所有项目。否则,您无法确定将项目放入队列的进程将终止。还请记住,非守护进程将自动加入。

上面的链接包含一个阻止示例,其中在get()之前调用join()。似乎在您的代码中可能取决于执行顺序,就像您在process_worker()内部调用get()一样。

  1. multiprocessing.Pool可能是管理工作人员池的更简单方法。参见https://docs.python.org/3.7/library/multiprocessing.html?highlight=process#using-a-pool-of-workers