运行多个并行发出HTTP请求的函数

时间:2018-01-28 23:21:51

标签: python multithreading python-requests

我正在编写一个脚本,该脚本会自动从多个网站上删除历史数据,并将其保存到指定日期范围内每个过去日期的同一个Excel文件中。每个单独的功能访问来自不同网站的多个网页,格式化数据,并将其写入单独的工作表上的文件。因为我不断在这些网站上发出请求,所以我确保在请求之间添加充足的休眠时间。而不是一个接一个地运行这些功能,有没有办法让我们一起运行它们?

我想用Function 1发出一个请求,然后用Function 2发出一个请求,依此类推,直到所有函数都发出了一个请求。在所有函数发出请求之后,我希望它循环回来并在每个函数内完成第二个请求(依此类推),直到给定日期的所有请求都完成为止。这样做可以在每个网站上的请求之间保持相同的休眠时间,同时减少代码运行大量的时间。需要注意的一点是,每个函数的HTTP请求数量略有不同。例如,函数1可以在给定日期发出10个请求,而函数2发出8个请求,函数3发出8,函数4发7,函数5发10。

我已经阅读了这个主题并阅读了多线程,但我不确定如何将其应用于我的特定场景。如果没有办法做到这一点,我可以将每个函数作为自己的代码运行并同时运行它们,但是我必须为每个日期连接五个不同的excel文件,这就是我尝试这样做的原因这样。

start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
for date in date_range:
    writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx')
    Function1()
    Function2()
    Function3()
    Function4()
    Function5()
    writer.save()
    print('Date Complete: ' + date)
    time.sleep(random.randrange(min_sleeptime,max_sleeptime,1))

3 个答案:

答案 0 :(得分:1)

使用Python3.6

以下是使用aiohttp启动(docs)的并发请求的最小示例。此示例同时运行3 downloader个,并将rsp附加到响应中。我相信你能够适应这个想法。

import asyncio

from aiohttp.client import ClientSession


async def downloader(session, iter_url, responses):
    while True:
        try:
            url = next(iter_url)
        except StopIteration:
            return
        rsp = await session.get(url)
        if not rsp.status == 200:
            continue  # < - Or raise error
        responses.append(rsp)


async def run(urls, responses):
    with ClientSession() as session:
        iter_url = iter(urls)
        await asyncio.gather(*[downloader(session, iter_url, responses) for _ in range(3)])


urls = [
    'https://stackoverflow.com/questions/tagged/python',
    'https://aiohttp.readthedocs.io/en/stable/',
    'https://docs.python.org/3/library/asyncio.html'
]

responses = []

loop = asyncio.get_event_loop()
loop.run_until_complete(run(urls, responses))

<强>结果:

>>> responses
[<ClientResponse(https://docs.python.org/3/library/asyncio.html) [200 OK]>
<CIMultiDictProxy('Server': 'nginx', 'Content-Type': 'text/html', 'Last-Modified': 'Sun, 28 Jan 2018 05:08:54 GMT', 'ETag': '"5a6d5ae6-6eae"', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains; preload', 'Via': '1.1 varnish', 'Fastly-Debug-Digest': '79eb68156ce083411371cd4dbd0cb190201edfeb12e5d1a8a1e273cc2c8d0e41', 'Content-Length': '28334', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '66775', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD, cache-mel6520-MEL', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1517183297.337465,VS0,VE1')>
, <ClientResponse(https://stackoverflow.com/questions/tagged/python) [200 OK]>
<CIMultiDictProxy('Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'SAMEORIGIN', 'X-Request-Guid': '3fb98f74-2a89-497d-8d43-322f9a202775', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Length': '23775', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-mel6520-MEL', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1517183297.107658,VS0,VE265', 'Vary': 'Accept-Encoding,Fastly-SSL', 'X-DNS-Prefetch-Control': 'off', 'Set-Cookie': 'prov=8edb36d8-8c63-bdd5-8d56-19bf14916c93; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly', 'Cache-Control': 'private')>
, <ClientResponse(https://aiohttp.readthedocs.io/en/stable/) [200 OK]>
<CIMultiDictProxy('Server': 'nginx/1.10.3 (Ubuntu)', 'Date': 'Sun, 28 Jan 2018 23:48:18 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Wed, 17 Jan 2018 08:45:22 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'ETag': 'W/"5a5f0d22-578a"', 'X-Subdomain-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Content-Encoding': 'gzip')>
]

答案 1 :(得分:1)

以下是演示如何使用concurrent.futures进行并行处理的最小示例。这不包括实际的抓取逻辑,因为您可以根据需要自己添加它,但演示了要遵循的模式:

from concurrent import futures
from concurrent.futures import ThreadPoolExecutor

def scrape_func(*args, **kwargs):
    """ Stub function to use with futures - your scraping logic """
    print("Do something in parallel")
    return "result scraped"

def main():
    start_date = 'YYYY-MM-DD'
    end_date = 'YYYY-MM-DD'
    idx = pd.date_range(start_date,end_date)
    date_range = [d.strftime('%Y-%m-%d') for d in idx]
    max_retries_min_sleeptime = 300
    max_retries_max_sleeptime = 600
    min_sleeptime = 150
    max_sleeptime = 250

    # The important part - concurrent futures 
    # - set number of workers as the number of jobs to process

    with ThreadPoolExecutor(len(date_range)) as executor:
        # Use list jobs for concurrent futures
        # Use list scraped_results for results
        jobs = []
        scraped_results = []

        for date in date_range:
            # Pass some keyword arguments if needed - per job    
            kw = {"some_param": "value"}

            # Here we iterate 'number of dates' times, could be different
            # We're adding scrape_func, could be different function per call
            jobs.append(executor.submit(scrape_func, **kw))

        # Once parallell processing is complete, iterate over results
        for job in futures.as_completed(jobs):
            # Read result from future
            scraped_result = job.result()
            # Append to the list of results
            scraped_results.append(scraped_result)

        # Iterate over results scraped and do whatever is needed
        for result is scraped_results:
            print("Do something with me {}".format(result))


if __name__=="__main__":
    main()

如上所述,这只是为了展示要遵循的模式,其余的应该是直截了当的。

答案 2 :(得分:0)

感谢回复家伙!事实证明,另一个问题(Make 2 functions run at the same time)中的一个非常简单的代码块似乎可以做我想要的。

import threading
from threading import Thread

def func1():
    print 'Working'

def func2():
    print 'Working'

if __name__ == '__main__':
    Thread(target = func1).start()
    Thread(target = func2).start()