Python多处理和处理工作程序中的异常

时间:2013-06-05 15:06:12

标签: python exception error-handling parallel-processing multiprocessing

我使用python多处理库进行算法,其中我有许多工作人员处理某些数据并将结果返回给父进程。我使用multiprocessing.Queue将作业传递给工作者,然后使用它来收集结果。

这一切都运行良好,直到工人无法处理一些数据。在下面的简化示例中,每个工作人员都有两个阶段:

  • 初始化 - 可能会失败,在这种情况下,工作人员应该被销毁
  • 数据处理 - 处理一大块数据可能会失败,在这种情况下,工作人员应跳过此块并继续下一个数据。

当这两个阶段中的任何一个失败时,脚本完成后会出现死锁。这段代码模拟了我的问题:

import multiprocessing as mp
import random

workers_count = 5
# Probability of failure, change to simulate failures
fail_init_p = 0.2
fail_job_p = 0.3


#========= Worker =========
def do_work(job_state, arg):
    if random.random() < fail_job_p:
        raise Exception("Job failed")
    return "job %d processed %d" % (job_state, arg)

def init(args):
    if random.random() < fail_init_p:
        raise Exception("Worker init failed")
    return args

def worker_function(args, jobs_queue, result_queue):
    # INIT
    # What to do when init() fails?
    try:
        state = init(args)
    except:
        print "!Worker %d init fail" % args
        return
    # DO WORK
    # Process data in the jobs queue
    for job in iter(jobs_queue.get, None):
        try:
            # Can throw an exception!
            result = do_work(state, job)
            result_queue.put(result)
        except:
            print "!Job %d failed, skip..." % job
        finally:
            jobs_queue.task_done()
    # Telling that we are done with processing stop token
    jobs_queue.task_done()



#========= Parent =========
jobs = mp.JoinableQueue()
results = mp.Queue()
for i in range(workers_count):
    mp.Process(target=worker_function, args=(i, jobs, results)).start()

# Populate jobs queue
results_to_expect = 0
for j in range(30):
    jobs.put(j)
    results_to_expect += 1

# Collecting the results
# What if some workers failed to process the job and we have
# less results than expected
for r in range(results_to_expect):
    result = results.get()
    print result

#Signal all workers to finish
for i in range(workers_count):
    jobs.put(None)

#Wait for them to finish
jobs.join()

我对此代码有两个疑问:

  1. init()失败时,如何检测该工作人员是否无效而不等待工作人员完成?
  2. do_work()失败时,如何通知父进程在结果队列中应该预期的结果较少?
  3. 谢谢你的帮助!

1 个答案:

答案 0 :(得分:11)

我稍微更改了您的代码以使其正常工作(请参阅下面的说明)。

import multiprocessing as mp
import random

workers_count = 5
# Probability of failure, change to simulate failures
fail_init_p = 0.5
fail_job_p = 0.4


#========= Worker =========
def do_work(job_state, arg):
    if random.random() < fail_job_p:
        raise Exception("Job failed")
    return "job %d processed %d" % (job_state, arg)

def init(args):
    if random.random() < fail_init_p:
        raise Exception("Worker init failed")
    return args

def worker_function(args, jobs_queue, result_queue):
    # INIT
    # What to do when init() fails?
    try:
        state = init(args)
    except:
        print "!Worker %d init fail" % args
        result_queue.put('init failed')
        return
    # DO WORK
    # Process data in the jobs queue
    for job in iter(jobs_queue.get, None):
        try:
            # Can throw an exception!
            result = do_work(state, job)
            result_queue.put(result)
        except:
            print "!Job %d failed, skip..." % job
            result_queue.put('job failed')


#========= Parent =========
jobs = mp.Queue()
results = mp.Queue()
for i in range(workers_count):
    mp.Process(target=worker_function, args=(i, jobs, results)).start()

# Populate jobs queue
results_to_expect = 0
for j in range(30):
    jobs.put(j)
    results_to_expect += 1

init_failures = 0
job_failures = 0
successes = 0
while job_failures + successes < 30 and init_failures < workers_count:
    result = results.get()
    init_failures += int(result == 'init failed')
    job_failures += int(result == 'job failed')
    successes += int(result != 'init failed' and result != 'job failed')
    #print init_failures, job_failures, successes

for ii in range(workers_count):
    jobs.put(None)

我的更改:

  1. jobs更改为正常Queue(而不是JoinableQueue)。
  2. 工作人员现在回传特殊结果字符串&#34; init failed&#34;和#34;工作失败&#34;。
  3. 只要特定条件生效,主进程就会监视所述特殊结果。
  4. 最后,把&#34;停止&#34;无论你有多少工人,都要求(即None个工作)。请注意,并非所有这些都可以从队列中提取(如果工作人员未能初始化)。
  5. 顺便说一下,您的原始代码很好用且易于使用。随机概率位非常酷。