Question

我正在编写一个用Python编写的非常基本的多线程Web爬虫，并使用While循环来抓取页面并提取URL的函数，如下所示：

def crawl():
    while True:
        try:
            p = Page(pool.get(True, 10))
        except Queue.Empty:
            continue

        # then extract urls from a page and put new urls into the queue

（完整的源代码在另一个问题：Multi-threaded Python Web Crawler Got Stuck）

理想情况下，我想在While循环中添加一个条件，以便在以下情况下退出while循环：

池（存储URL的Queue对象）为空，并且;
所有线程都在阻塞，等待从队列中获取一个url（这意味着没有线程将新的url放入池中，因此让它们等待是没有意义的，并且会使我的程序卡住。）< / p>

例如：

#thread-1.attr == 1 means the thread-1 is blocking. 0 means not blocking

while not (pool.empty() and (thread-1.attr == 1 and thread-2.attr == 1 and ...)):
    #do the crawl stuff

所以我想知道是否有一个线程可以检查其他活动线程正在做什么，或者其他活动线程的属性的状态或值。

我已经阅读了关于threading.Event（）的官方文档，但仍然无法弄明白。

希望有人能指出我的方式：）

非常感谢！

马库斯

Answer 1

你可以尝试从头开始实现你想要的东西，现在我想到了不同的解决方案：

使用threading.enumerate(）检查是否存在仍处于活动状态的线程。
尝试实现一个线程池，让您知道返回池中的哪个线程仍处于活动状态，这也有利于限制抓取第三方网站的线程数（检查here例子）。

如果你不想重新发明轮子你可以使用实现线程池的现有库，或者你也可以检查使用绿色线程的 gevent 并提供thread pool，我有使用类似的东西实现类似的东西：

while 1:
    try:
        url = queue.get_nowait()
    except Empty:
        # Check that all threads are done.
        if pool.free_count() == pool.size:
            break
    ...

您还可以将一个sentinel对象写入队列，标记爬行结束并存在主循环并等待线程完成（例如使用池）。

while 1:
    try:
        url = queue.get_nowait()
        # StopIteration mark that no url will be added to the queue anymore.
        if url is StopIteration:
             break
    except Empty:
        continue
    ...
pool.join()

您可以选择自己喜欢的那个，希望这很有帮助。

Answer 2

考虑一下这个解决方案：Web crawler Using Twisted。作为该问题的答案，我建议您查看http://scrapy.org/

Python中的多线程（直接使用线程）是令人讨厌的，所以我会避免它并使用某种消息传递或基于反应器的编程。

Python线程通信解决方案

2 个答案: