多线程Python Web爬虫被困住了

时间:2013-01-18 17:03:41

标签: python multithreading python-multithreading

我正在编写一个Python网络爬虫,我想让它成为多线程的。现在我已经完成了基本部分,下面是它的作用:

  1. 一个帖子从队列中获取一个url;

  2. 线程从页面中提取链接,检查池中是否存在链接(一组),并将新链接放入队列和池中;

  3. 线程将url和http响应写入csv文件。

  4. 但是当我运行爬虫时,它最终会卡住,而不是正常退出。我已经阅读了Python的官方文档,但仍然没有任何线索。

    以下是代码:

    #!/usr/bin/env python
    #!coding=utf-8
    
    import requests, re, urlparse
    import threading
    from Queue import Queue
    from bs4 import BeautifulSoup
    
    #custom modules and files
    from setting import config
    
    
    class Page:
    
        def __init__(self, url):
    
            self.url = url
            self.status = ""
            self.rawdata = ""
            self.error = False
    
            r = ""
    
            try:
                r = requests.get(self.url, headers={'User-Agent': 'random spider'})
            except requests.exceptions.RequestException as e:
                self.status = e
                self.error = True
            else:
                if not r.history:
                    self.status = r.status_code
                else:
                    self.status = r.history[0]
    
            self.rawdata = r
    
        def outlinks(self):
    
            self.outlinks = []
    
            #links, contains URL, anchor text, nofollow
            raw = self.rawdata.text.lower()
            soup = BeautifulSoup(raw)
            outlinks = soup.find_all('a', href=True)
    
            for link in outlinks:
                d = {"follow":"yes"}
                d['url'] = urlparse.urljoin(self.url, link.get('href'))
                d['anchortext'] = link.text
                if link.get('rel'):
                    if "nofollow" in link.get('rel'):
                        d["follow"] = "no"
                if d not in self.outlinks:
                    self.outlinks.append(d)
    
    
    pool = Queue()
    exist = set()
    thread_num = 10
    lock = threading.Lock()
    output = open("final.csv", "a")
    
    #the domain is the start point
    domain = config["domain"]
    pool.put(domain)
    exist.add(domain)
    
    
    def crawl():
    
        while True:
    
            p = Page(pool.get())
    
            #write data to output file
            lock.acquire()
            output.write(p.url+" "+str(p.status)+"\n")
            print "%s crawls %s" % (threading.currentThread().getName(), p.url)
            lock.release()
    
            if not p.error:
                p.outlinks()
                outlinks = p.outlinks
                if urlparse.urlparse(p.url)[1] == urlparse.urlparse(domain)[1] :
                    for link in outlinks:
                        if link['url'] not in exist:
                            lock.acquire()
                            pool.put(link['url'])
                            exist.add(link['url'])
                            lock.release()
            pool.task_done()            
    
    
    for i in range(thread_num):
        t = threading.Thread(target = crawl)
        t.start()
    
    pool.join()
    output.close()
    

    任何帮助将不胜感激!

    由于

    马库斯

1 个答案:

答案 0 :(得分:3)

您的抓取功能具有无限循环,没有可能的退出路径。 条件True始终评估为True,循环继续,如您所说,

  

没有正常退出

修改抓取功能的while循环以包含条件。例如,当保存到csv文件的链接数超过某个最小数时,则退出while循环。

即,

def crawl():
    while len(exist) <= min_links:
        ...