Queue.join()没有解锁

时间:2015-06-12 14:13:48

标签: python multithreading queue

我正在尝试编写用于并行抓取网站的Python脚本。我制作了一个原型,可以让我爬到深度。

但是,join()似乎无法正常工作,我无法弄清楚原因。

这是我的代码:

from threading import Thread
import Queue
import urllib2
import re
from BeautifulSoup import *
from urlparse import urljoin


def doWork():
    while True:
        try:
            myUrl = q_start.get(False)
        except:
            continue
        try:
            c=urllib2.urlopen(myUrl)
        except:
            continue
        soup = BeautifulSoup(c.read())
        links = soup('a')
        for link in links:
            if('href' in dict(link.attrs)):
                url = urljoin(myUrl,link['href'])
                if url.find("'")!=-1: continue
                url=url.split('#')[0]
                if url[0:4] == 'http':
                    print url
                    q_new.put(url)




q_start = Queue.Queue()

q_new = Queue.Queue()



for i in range(20):
        t = Thread(target=doWork)
        t.daemon = True
        t.start()


q_start.put("http://google.com")
print "loading"
q_start.join()
print "end"

1 个答案:

答案 0 :(得分:3)

join() will block until task_done() has been called as many times as items have been enqueued

您不能致电task_done(),因此join()会阻止。在您提供的代码中,调用此代码的正确位置位于doWork循环的最后:

def doWork():
  while True:
    task = start_q.get(False)
    ...
    for subtask in processed(task):
      ...
    start_q.task_done()  # tell the producer we completed a task