Question

我想使用requests.get()从多个URL获取数据（仅JSON文件）。这些网址保存在pandas数据框列中，我将响应保存在本地的JSON文件中。

i=0
start = time()
for url in pd_url['URL']:
    time_1 = time()
    r_1 = requests.get(url, headers = headers).json()
    filename = './jsons1/'+str(i)+'.json'
    with open(filename, 'w') as f:
        json.dump(r_1, f)
    i+=1

time_taken = time()-start
print('time taken:', time_taken)

目前，我已经编写了代码，如上所示，使用for循环从每个URL一次获取数据。但是，该代码花费太多时间来执行。有什么方法可以一次发送多个请求并使此程序运行更快？

此外，哪些因素可能会延迟响应？
我的互联网连接具有低延迟和足够的速度，可以在不到20秒的时间内“理论上”执行上述操作。尽管如此，上面的代码每次运行都需要145-150秒。我的目标是在30秒内完成此执行。请提出解决方法。

Answer 1

听起来像您想要pacman -S mingw-w64-x86_64-extra-cmake-modules，所以请在标准库中使用multi-threading。可以在concurrent.futures包中找到。

ThreadPoolExecutor

您可以根据需要增加或减少指定为import concurrent.futures def make_request(url, headers): resp = requests.get(url, headers=headers).json() return resp with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: futures = (executor.submit(make_request, url, headers) for url in pd_url['URL']) for idx, future in enumerate(concurrent.futures.as_completed(futures)): try: data = future.result() except Exception as exc: print(f"Generated an exception: {exc}") with open(f"./jsons1/{idx}.json", 'w') as f: json.dump(data, f)的线程数。

Answer 2

您可以利用多个线程来并行化获取。 This article提供了一种使用concurrent.futures模块中的ThreadPoolExecutor类的方法。

在我处理此问题时，@ gold_cy看起来似乎发布了几乎相同的答案，但是为了后代，这是我的示例。我已经将您的代码修改为使用执行程序，并且尽管无法方便地访问JSON网址列表，但我对其进行了一些小的修改以在本地运行。

我使用的是100个URL的列表，而串行获取列表大约需要125秒，而使用10个工作程序则大约需要27秒。我为请求添加了超时，以防止损坏的服务器阻止所有内容，并添加了一些代码来处理错误响应。

import json
import pandas
import requests
import time

from concurrent.futures import ThreadPoolExecutor


def fetch_url(data):
    index, url = data
    print('fetching', url)
    try:
        r = requests.get(url, timeout=10)
    except requests.exceptions.ConnectTimeout:
        return

    if r.status_code != 200:
        return

    filename = f'./data/{index}.json'
    with open(filename, 'w') as f:
        json.dump(r.text, f)


pd_url = pandas.read_csv('urls.csv')

start = time.time()
with ThreadPoolExecutor(max_workers=10) as runner:
    for _ in runner.map(fetch_url, enumerate(pd_url['URL'])):
        pass

    runner.shutdown()

time_taken = time.time()-start
print('time taken:', time_taken)

还有，哪些因素会延迟响应？

远程服务器的响应时间将成为主要瓶颈。

如何使用请求库的get函数发送多个“ GET”请求？

2 个答案: