我刚开始尝试做多线程/多处理器的东西,并遇到一些问题。 我想要做的是生成一些应该从远程数据库下载的数据请求。它们存储在Queue.Queue中(我们称之为in_q)。一旦我生成了所有请求,我就会启动一个有限数量的线程类,它将in_q和另一个Queue(out_q)作为输入。然后我从q_in获取()作业并将结果输出到q_out。所以这部分是IO绑定的,因此我认为线程是一个不错的选择。 q_out的结果由进程池消耗,这些进程对进程执行一些操作。这部分是受CPU限制的,因此我认为流程是一个不错的选择。
现在这似乎工作正常,除了我遇到了一个奇怪的行为,我已经在下面演示了。
import threading
import Queue
import multiprocessing as mp
class TestThread(threading.Thread):
def __init__ ( self, threadnr,resultPool,jobPool ):
self.threadnr = threadnr
self.resultPool = resultPool
self.jobPool = jobPool
threading.Thread.__init__ ( self )
def run(self):
while True:
job = self.jobPool.get()
if job != None:
for a in range(10):
for i in xrange(1000000):
pass
print "Thread nr %d finished job %d" % (self.threadnr,job)
self.resultPool.put([self.threadnr,job+1])
self.jobPool.task_done()
def test(i):
print mp.current_process().name,"test",i
return mp.current_process().name,"test",i
if __name__ == '__main__':
q_in = Queue.Queue()
q_out = Queue.Queue()
nr_jobs = 20
res = []
nr_threads = 4
threads = []
for i in range(nr_jobs):
q_in.put(i)
for i in range(nr_threads):
t = TestThread(i,q_out,q_in)
t.start()
threads.append(t)
p_pool = mp.Pool(4)
for i in range(nr_jobs):
job = q_out.get(block=True)
print "Got job",job
res.append(p_pool.apply_async(test,(job,)))
p_pool.close()
p_pool.join()
for r in res:
print r.get()
for t in threads:
t.join()
这个输出是:
Thread nr 2 finished job 2
Got job [2, 3]
Thread nr 0 finished job 0
Got job [0, 1]
Thread nr 1 finished job 1
Got job [1, 2]
Thread nr 3 finished job 3
Got job [3, 4]
Thread nr 2 finished job 4
Got job Thread nr 0 finished job 5[
2, 5]
Got job [0, 6]
Thread nr 1 finished job 6
Got job [1, 7]
Thread nr 3 finished job 7
Got job [3, 8]
Thread nr 2 finished job 8
Got job [2, 9]
Thread nr 0 finished job 9
Got job [0, 10]
PoolWorker-4 test [1, 2]
PoolWorker-4 test [1, 7]
PoolWorker-3 test [3, 4]
PoolWorker-3 test [3, 8]
PoolWorker-2 test [0, 1]
PoolWorker-2 test [0, 6]
PoolWorker-2 test [0, 10]
PoolWorker-1 test [2, 3]
PoolWorker-1 test [2, 5]
PoolWorker-1 test [2, 9]
('PoolWorker-1', 'test', [2, 3])
('PoolWorker-2', 'test', [0, 1])
('PoolWorker-4', 'test', [1, 2])
('PoolWorker-3', 'test', [3, 4])
('PoolWorker-1', 'test', [2, 5])
('PoolWorker-2', 'test', [0, 6])
('PoolWorker-4', 'test', [1, 7])
('PoolWorker-3', 'test', [3, 8])
('PoolWorker-1', 'test', [2, 9])
('PoolWorker-2', 'test', [0, 10])
这是一个测试程序,在很大程度上像我的真实程序一样工作。我觉得奇怪的是,即使线程需要相对较长的时间来完成,但是在线程完成所有工作之前,不会打印出进程。看起来似乎连续消耗了作业,但是在完成所有线程之后才会显示进程的输出。
在这个例子中,它是相当无害的(如果讨厌),但在我的真实程序中...输出的排队似乎导致内存错误,因为进程的所有输出都被延迟,直到最后一个线程完成。
作为一个插件问题,混合线程和进程是一个好主意,还是应该坚持一个或另一个?
我很感激有关此事的任何想法。