Question

我是python和线程的新手。我编写了python代码，它充当网络爬虫，并在网站上搜索特定的关键字。我的问题是，如何使用线程同时运行我的类的三个不同实例。当其中一个实例找到关键字时，所有三个实例都必须关闭并停止对Web进行爬网。这是一些代码。

class Crawler:
      def __init__(self):
            # the actual code for finding the keyword 

 def main():  
        Crawl = Crawler()

 if __name__ == "__main__":
        main()

如何使用线程让Crawler同时进行三次不同的抓取？

Answer 1

似乎没有（简单）方法在Python中终止线程。

以下是并行运行多个HTTP请求的简单示例：

import threading

def crawl():
    import urllib2
    data = urllib2.urlopen("http://www.google.com/").read()

    print "Read google.com"

threads = []

for n in range(10):
    thread = threading.Thread(target=crawl)
    thread.start()

    threads.append(thread)

# to wait until all three functions are finished

print "Waiting..."

for thread in threads:
    thread.join()

print "Complete."

通过额外的开销，您可以使用更强大的multi-process aproach，并允许您终止类似线程的进程。

我已将示例扩展为使用它。我希望这会对你有所帮助：

import multiprocessing

def crawl(result_queue):
    import urllib2
    data = urllib2.urlopen("http://news.ycombinator.com/").read()

    print "Requested..."

    if "result found (for example)":
        result_queue.put("result!")

    print "Read site."

processs = []
result_queue = multiprocessing.Queue()

for n in range(4): # start 4 processes crawling for the result
    process = multiprocessing.Process(target=crawl, args=[result_queue])
    process.start()
    processs.append(process)

print "Waiting for result..."

result = result_queue.get() # waits until any of the proccess have `.put()` a result

for process in processs: # then kill them all off
    process.terminate()

print "Got result:", result

Answer 2

启动一个帖子很简单：

thread = threading.Thread(function_to_call_inside_thread)
thread.start()

创建一个事件对象，以便在完成后通知：

event = threading.Event()
event.wait() # call this in the main thread to wait for the event
event.set() # call this in a thread when you are ready to stop

事件触发后，您需要向抓取工具添加stop（）方法。

for crawler in crawlers:
    crawler.stop()

然后在线程上调用join

thread.join() # waits for the thread to finish

如果您进行任何此类编程，您将需要查看eventlet模块。它允许您编写“线程”代码，而没有线程的许多缺点。

Answer 3

首先，如果你是python的新手，我不建议面对线程。 习惯语言，然后处理多线程。

话虽如此，如果你的目标是并行化（你说“同时运行”），你应该知道在python中（或至少在默认实现中，CPython）多个线程不会真正并行运行，即使有多个处理器核心可用。阅读GIL（全球口译员锁）以获取更多信息。

最后，如果您仍想继续，请检查线程模块的Python documentation。我会说Python的文档和引用一样好，有大量的例子和解释。

Answer 4

对于这个问题，您可以使用线程模块（正如其他人所说，由于GIL而不会执行真正的线程）或多处理模块（取决于您使用的Python版本）。它们有非常相似的API，但我建议使用多处理，因为它更像Pythonic，我发现使用Pipes在进程之间进行通信非常容易。

您将需要拥有主循环，这将创建您的进程，并且每个进程都应运行您的爬网程序，并将管道返回到主线程。您的进程应该在管道上侦听消息，进行一些爬行，并在发现某些内容（终止之前）时通过管道发回消息。你的主循环应该遍历每个管道回到它，听取这个“发现的东西”的消息。一旦听到该消息，它应该通过管道将其重新发送到剩余的进程，然后等待它们完成。

可在此处找到更多信息：http://docs.python.org/library/multiprocessing.html

Answer 5

首先，线程不是Python的解决方案。由于GIL，Threads不能并行工作。因此，您可以使用多处理处理此问题，并且您将受限于处理器核心数量。

你工作的目标是什么？你想要一个爬虫吗？或者你有一些学术目标（学习线程和Python等）？

另一点，抓取浪费的资源比其他程序多，所以你的爬行是什么销售？

任何线程完成任务时终止多个线程

5 个答案: