Question

我是python中的新开发人员。我的代码如下：

import warnings
import requests
import multiprocessing

from colorama import init
init(autoreset=True)

from requests.packages.urllib3.exceptions import InsecureRequestWarning
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter('ignore', InsecureRequestWarning)

from bs4 import BeautifulSoup as BS

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


class Worker(multiprocessing.Process):

    def run(self):
        with open('ips.txt', 'r') as urls:
            for url in urls.readlines():
                req = url.strip()
                try:
                    page = requests.get(req, headers=headers, verify=False, allow_redirects=False, stream=True,
                                        timeout=10)
                    soup = BS(page.text)
                    # string = string.encode('ascii', 'ignore')
                    print('\033[32m' + req + ' - Title: ', soup.title)
                except requests.RequestException as e:
                    print('\033[32m' + req + ' - TimeOut!')
        return


if __name__ == '__main__':
    jobs = []
    for i in range(5):
        p = Worker()
        jobs.append(p)
        p.start()
    for j in jobs:
        j.join()

我正在尝试使程序读取IPs.txt并打印出每个网站的标题。

它可以在单个线程中完美地工作。现在，我想使用multiprocessing使其更快。

但是由于某种原因它只输出5条相同的行。我是多处理技术的新手，但尝试失败却尽了最大努力。

显示问题的屏幕截图：

我只想让5名工作人员以多线程或并行方式检查IPs.txt……我只是想使其更快。

有任何提示，线索，帮助吗？

Answer 1

问题

代码中的主要问题是，每个Worker从头开始打开ips.txt并在ips.txt中找到的每个URL上工作。因此，这五个工作人员一起打开ips.txt五次，并在每个URL上工作五次。

解决方案

解决此问题的正确方法是将代码分为 master 和 worker 。您已经实现了大多数工作程序代码。现在，让我们将主要部分（在if __name__ == '__main__':下）作为主要部分。

现在，主人应该派遣五名工人，并通过队列（multiprocessing.Queue）向他们发送工作。

multiprocessing.Queue类为多个生产者提供了一种将数据放入其中的方法，并且多个消费者可以从其中读取数据而又不会遇到竞争条件。此类实现了所有必要的锁定语义，以在多处理上下文中安全地交换数据并防止出现竞争情况。

固定代码

以下是我根据上述内容重写代码的方式：

import warnings
import requests
import multiprocessing

from colorama import init
init(autoreset=True)

from requests.packages.urllib3.exceptions import InsecureRequestWarning
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter('ignore', InsecureRequestWarning)

from bs4 import BeautifulSoup as BS

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


class Worker(multiprocessing.Process):

    def __init__(self, job_queue):
        super().__init__()
        self._job_queue = job_queue

    def run(self):
        while True:
            url = self._job_queue.get()
            if url is None:
                break

            req = url.strip()

            try:
                page = requests.get(req, headers=headers, verify=False, allow_redirects=False, stream=True,
                                    timeout=10)
                soup = BS(page.text)
                # string = string.encode('ascii', 'ignore')
                print('\033[32m' + req + ' - Title: ', soup.title)
            except requests.RequestException as e:
                print('\033[32m' + req + ' - TimeOut!')


if __name__ == '__main__':
    jobs = []
    job_queue = multiprocessing.Queue()

    for i in range(5):
        p = Worker(job_queue)
        jobs.append(p)
        p.start()

    # This is the master code that feeds URLs into queue.
    with open('ips.txt', 'r') as urls:
        for url in urls.readlines():
            job_queue.put(url)

    # Send None for each worker to check and quit.
    for j in jobs:
        job_queue.put(None)

    for j in jobs:
        j.join()

在上面的代码中我们可以看到，主机一次打开ips.txt，从主机中一次读取URL，并将其放入队列。每个工作程序都等待URL到达此队列。 URL到达队列后，其中一名工作人员便将其拾取并变得很忙。如果队列中有更多URL，则下一个免费工作人员会选择下一个，依此类推。

最后，我们需要某种方式让工人在完成所有工作后退出。有几种方法可以实现此目的。在此示例中，我选择了一种简单的策略，即将五个哨兵值（在这种情况下为五个None值）发送到队列中，每个工人一个，以便每个工人都可以接管并退出。

还有另一种策略，工人和主人共享multiprocessing.Event对象，就像他们现在共享{{1}}对象一样。主服务器每当希望工人退出时就调用该对象的multiprocessing.Queue方法。工作人员检查该对象set()是否退出。但是，这给代码带来了一些额外的复杂性。我在下面对此进行了讨论。

为了完整性，也为了演示最少，完整和可验证的示例，我在下面提供了两个代码示例，这些示例显示了两种停止策略。

使用前哨值停止工作人员

到目前为止，这几乎是我上面已经描述的，只是代码示例已大大简化，以消除对Python标准库之外的任何库的依赖。

在下面的示例中值得注意的另一件事是，我们没有使用worker函数，而是使用worker函数并在其中创建了is_set()。这种类型的代码通常可以在Python文档中找到，并且非常习惯。

Process

使用事件停止工作人员

使用import multiprocessing import time import random def worker(input_queue): while True: url = input_queue.get() if url is None: break print('Started working on:', url) # Random delay to simulate fake processing. time.sleep(random.randint(1, 3)) print('Stopped working on:', url) def master(): urls = [ 'https://example.com/', 'https://example.org/', 'https://example.net/', 'https://stackoverflow.com/', 'https://www.python.org/', 'https://github.com/', 'https://susam.in/', ] input_queue = multiprocessing.Queue() workers = [] # Create workers. for i in range(5): p = multiprocessing.Process(target=worker, args=(input_queue, )) workers.append(p) p.start() # Distribute work. for url in urls: input_queue.put(url) # Ask the workers to quit. for w in workers: input_queue.put(None) # Wait for workers to quit. for w in workers: w.join() print('Done') if __name__ == '__main__': master()对象来通知工人何时应该退出，这会在代码中带来一些复杂性。首先必须进行三项更改：

在主数据库中，我们在multiprocessing.Event对象上调用set()方法，以指示工人应尽快退出。
在工作程序中，我们定期调用Event对象的is_set()方法以检查是否应该退出。
在主服务器中，我们需要使用Event而不是multiprocessing.JoinableQueue，以便它可以在要求工人退出之前测试队列是否已被工人完全用尽。
在工作进程中，我们需要在队列中的每个项目都用完之后调用队列的multiprocessing.Queue方法。对于主机来说，调用队列的task_done()方法以测试是否已清空是必要的。

所有这些更改都可以在下面的代码中找到：

join()

如何在请求模块中使用多处理？

1 个答案:

问题

解决方案

固定代码

使用前哨值停止工作人员

使用事件停止工作人员