Question

我必须向大量网站发出大量（数千）HTTP GET请求。由于某些网站可能无法响应（或需要很长时间才能做出响应）而其他网站超时，因此这非常缓慢。因为我需要尽可能多的响应，设置一个小的超时（3-5秒）对我不利。

我还没有在Python中进行任何类型的多处理或多线程，我已经阅读了很长一段时间的文档。这是我到目前为止所做的：

import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Pool

errors = 0

def get_site_content(site):
    try :
        # start = time.time()
        response = requests.get(site, allow_redirects = True, timeout=5)
        response.raise_for_status()
        content = response.text
    except Exception as e:
        global errors
        errors += 1
        return ''
    soup = BeautifulSoup(content)
    for script in soup(["script", "style"]):
        script.extract()
    text = soup.get_text()

    return text

sites = ["http://www.example.net", ...]

pool = Pool(processes=5)
results = pool.map(get_site_content, sites)
print results

现在，我希望返回的结果以某种方式加入。这允许两种变化：

每个进程都有一个本地列表/队列，其中包含已累积的内容与其他队列连接以形成单个结果，包含所有站点的所有内容。
每个进程在进行时写入单个全局队列。这将需要一些锁定机制进行并发检查。

多处理或多线程会是更好的选择吗？我将如何使用Python中的任何一种方法完成上述操作？

修改：

我确实尝试过以下内容：

# global
queue = []
with Pool(processes = 5) as pool:
    queue.append(pool.map(get_site_contents, sites))

print queue

然而，这给了我以下错误：

with Pool(processes = 4) as pool:
AttributeError: __exit__

我不太明白。我在理解究竟是什么时遇到了一些麻烦。在过去的第二个参数中，将函数应用于每个对象。它会返回什么吗？如果没有，我是否从函数中附加到全局队列？

Answer 1

pool.map开始＆＃39; n＆＃39;获取函数并使用iterable中的项运行它的进程数。当这样的进程完成并返回时，返回的值将存储在与输入变量中的输入项相同位置的结果列表中。

例如：如果编写函数来计算数字的平方，则使用pool.map在数字列表上运行此函数。 def square_this（x）： square = x ** 2 返回广场

input_iterable = [2, 3, 4]
pool = Pool(processes=2) # Initalize a pool of 2 processes
result = pool.map(square_this, input_iterable) # Use the pool to run the function on the items in the iterable
pool.close() # this means that no more tasks will be added to the pool
pool.join() # this blocks the program till function is run on all the items
# print the result
print result

...>>[4, 9, 16]

Pool.map技术在您的情况下可能并不理想，因为它会阻塞直到所有进程完成。即如果某个网站没有响应或需要太长时间才能响应您的程序，那么将无法等待它。而是尝试对您自己的类中的multiprocessing.Process进行子类化，该类会轮询这些网站并使用队列来访问结果。当您有足够数量的回复时，您可以停止所有过程，而无需等待剩余的请求完成。

Answer 2

我在大学里有一个类似的任务（实现一个多进程Web爬虫），并使用python多处理库中的多处理安全Queue类，它将通过锁和并发检查完成所有魔术。文档中的示例说明：

import multiprocessing as mp

def foo(q):
    q.put('hello')

if __name__ == '__main__':
    mp.set_start_method('spawn')
    q = mp.Queue()
    p = mp.Process(target=foo, args=(q,))
    p.start()
    print(q.get())
    p.join()

但是，我必须为此工作编写一个单独的流程类，因为我希望它能够工作。我还没有使用过程池。相反，我尝试检查内存使用情况并生成一个进程，直到达到预设的内存阈值。

多处理HTTP在Python中获取请求

2 个答案: