Question

以下代码中的BrokenLinkTest类执行以下操作。

采用网页网址
查找网页中的所有链接
同时获取链接的标题（这样做是为了检查链接是否损坏）
收到所有标题后打印'已完成'。

from bs4 import BeautifulSoup
import requests

class BrokenLinkTest(object):

    def __init__(self, url):
        self.url = url
        self.thread_count = 0
        self.lock = threading.Lock()

    def execute(self):
        soup = BeautifulSoup(requests.get(self.url).text)
        self.lock.acquire()
        for link in soup.find_all('a'):
            url = link.get('href')
            threading.Thread(target=self._check_url(url))
        self.lock.acquire()

    def _on_complete(self):
        self.thread_count -= 1
        if self.thread_count == 0: #check if all the threads are completed
            self.lock.release()
            print "completed"

    def _check_url(self, url):
        self.thread_count += 1
        print url
        result = requests.head(url)
        print result
        self._on_complete()


BrokenLinkTest("http://www.example.com").execute()

可以更好地完成并发/同步部分。我使用threading.Lock做到了。这是我第一次使用python线程进行实验。

Answer 1

Python中的所有线程都在同一个核心上运行，因此您不会通过这种方式获得任何性能。另外 - 目前还不清楚实际发生了什么？

你永远不会真正开始一个线程，你只是初始化它
除了减少线程数

如果您的程序正在向IO提供工作（发送请求，写入文件等），其他线程可以同时工作，那么您可能只能在基于线程的场景中获得性能。

Answer 2

def execute(self):
    soup = BeautifulSoup(requests.get(self.url).text)
    threads = []
    for link in soup.find_all('a'):
        url = link.get('href')
        t = threading.Thread(target=self._check_url, args=(url,))
        t.start()
        threads.append(t)
    for thread in threads:
        thread.join()

您可以使用join方法等待所有线程完成。

注意我还添加了一个start调用，并将绑定的方法对象传递给目标param。在原始示例中，您在主线程中调用_check_url并将返回值传递给目标参数。

在Python中同步多线程

2 个答案: