Question

我正在使用python的web爬虫来获取网页中的所有绝对URL，然后为该网页中的每个URL执行Web爬虫。

例如：

我们可以看到http://www.enlightedinc.com/网页包含以下网址。所以一个线程完成了抓取http://www.enlightedinc.com/网址。它将继续为http://www.enlightedinc.com/support/网址执行网络抓取工具。那么它将用于下一个网址等等......

为了使这些工作正常，我使用了Queue，Set，concurrent.futures.ThreadPoolExecutor。

1。队列使用多个线程，因此它可以在每个URL上工作。 2.设置为确保每个URL都应访问一次。 3. concurrent.futures.ThreadPoolExecutor使用多个线程。

这是一段代码。

#python web_crawler.py http://www.enlightedinc.com/
from urllib.request import Request, urlopen
import argparse
import queue
import threading
import concurrent.futures
from bs4 import BeautifulSoup
# set up queue
url_queue = queue.Queue()
lock = threading.Lock()
# created a set to store all visited links
set = set()
total_thread = 2

def web_crawler():
    while True:
        try:
            base_url = url_queue.get(True, 10).strip()
            try:
                # add new url to set and skip visited url
                lock.acquire()
                if base_url in set:
                    continue
                else:
                    set.add(base_url)
            finally:
                lock.release()
            try:
                # get all urls and put in queue
                str_url = base_url
                req = Request(base_url, headers={'User-Agent': 'Mozilla/5.0'})
                html = urlopen(req, timeout = 5).read()
                bs = BeautifulSoup(html, "html.parser")
                possible_links = bs.find_all('a')
                for link in possible_links:
                    if link.has_attr('href') and ("http" in link.attrs['href'] or "https" in link.attrs['href']) and ".pdf" not in link.attrs['href'] and ".png" not in link.attrs['href'] and ".jpg" not in link.attrs['href'] and ".jpeg" not in link.attrs['href']:
                        str_url += "\n\t" + (link.attrs['href'])
                        url_queue.put((link.attrs['href']))
                lock.acquire()
                print (str_url)
                lock.release()
            except urllib.error.HTTPError as e:
                print(e.code)
            except Exception as e:
                print ("Error: " + str(e))
        except queue.Empty:
            print("empty occure")
            break
        finally:
            url_queue.task_done()


# parse the argument
def parse_args():
    parser = argparse.ArgumentParser(description='Script for Web Crawler')
    parser.add_argument('url',help='Starting URL')
    return parser.parse_args()

def main():
    args = parse_args()
    url_queue.put(args.url)
    # Created threadPool to work with multiple threads
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for i in range(total_thread):
            executor.submit(web_crawler)
    print ("size is ")
    print (url_queue.qsize())
    url_queue.join()

if __name__ == "__main__":
    main()

在Main方法中，我编写了 url_queue.join（）以确保该线程具有   处理队列中的所有URL，但是当我们检查队列大小时，队列   已有数据。所以主线程将等待永远完成所有的URL   过程

但问题是......线程在某一点停止工作。我是   如果队列为空，则使用break语句。所以一旦所有线程停止   我在main方法中检查队列大小，但队列大小不为零。然后   为什么线程走出循环？

我尝试了很多东西来理解这些场景，但没有得到任何解决方案。

我希望得到任何帮助或提示。先感谢您。

Answer 1

我认为队列的大小不会改变。正如this answer所说，queue.task_done()不会弹出队列中的项目，即队列大小只会通过您的脚本增加。将处理所有链接，但最后打印将显示已访问链接的数量，而不是未处理链接的数量。

concurrent.futures.ThreadPoolExecutor不能用作例外。线程停止工作，即使队列有足够的数据可以使用

1 个答案: