我正在编写一个脚本,该脚本会自动从多个网站上删除历史数据,并将其保存到指定日期范围内每个过去日期的同一个Excel文件中。每个单独的功能访问来自不同网站的多个网页,格式化数据,并将其写入单独的工作表上的文件。因为我不断在这些网站上发出请求,所以我确保在请求之间添加充足的休眠时间。而不是一个接一个地运行这些功能,有没有办法让我们一起运行它们?
我想用Function 1发出一个请求,然后用Function 2发出一个请求,依此类推,直到所有函数都发出了一个请求。在所有函数发出请求之后,我希望它循环回来并在每个函数内完成第二个请求(依此类推),直到给定日期的所有请求都完成为止。这样做可以在每个网站上的请求之间保持相同的休眠时间,同时减少代码运行大量的时间。需要注意的一点是,每个函数的HTTP请求数量略有不同。例如,函数1可以在给定日期发出10个请求,而函数2发出8个请求,函数3发出8,函数4发7,函数5发10。
我已经阅读了这个主题并阅读了多线程,但我不确定如何将其应用于我的特定场景。如果没有办法做到这一点,我可以将每个函数作为自己的代码运行并同时运行它们,但是我必须为每个日期连接五个不同的excel文件,这就是我尝试这样做的原因这样。
start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
for date in date_range:
writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx')
Function1()
Function2()
Function3()
Function4()
Function5()
writer.save()
print('Date Complete: ' + date)
time.sleep(random.randrange(min_sleeptime,max_sleeptime,1))
答案 0 :(得分:1)
使用Python3.6
以下是使用aiohttp
启动(docs)的并发请求的最小示例。此示例同时运行3 downloader
个,并将rsp
附加到响应中。我相信你能够适应这个想法。
import asyncio
from aiohttp.client import ClientSession
async def downloader(session, iter_url, responses):
while True:
try:
url = next(iter_url)
except StopIteration:
return
rsp = await session.get(url)
if not rsp.status == 200:
continue # < - Or raise error
responses.append(rsp)
async def run(urls, responses):
with ClientSession() as session:
iter_url = iter(urls)
await asyncio.gather(*[downloader(session, iter_url, responses) for _ in range(3)])
urls = [
'https://stackoverflow.com/questions/tagged/python',
'https://aiohttp.readthedocs.io/en/stable/',
'https://docs.python.org/3/library/asyncio.html'
]
responses = []
loop = asyncio.get_event_loop()
loop.run_until_complete(run(urls, responses))
<强>结果:强>
>>> responses
[<ClientResponse(https://docs.python.org/3/library/asyncio.html) [200 OK]>
<CIMultiDictProxy('Server': 'nginx', 'Content-Type': 'text/html', 'Last-Modified': 'Sun, 28 Jan 2018 05:08:54 GMT', 'ETag': '"5a6d5ae6-6eae"', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains; preload', 'Via': '1.1 varnish', 'Fastly-Debug-Digest': '79eb68156ce083411371cd4dbd0cb190201edfeb12e5d1a8a1e273cc2c8d0e41', 'Content-Length': '28334', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '66775', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD, cache-mel6520-MEL', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1517183297.337465,VS0,VE1')>
, <ClientResponse(https://stackoverflow.com/questions/tagged/python) [200 OK]>
<CIMultiDictProxy('Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'SAMEORIGIN', 'X-Request-Guid': '3fb98f74-2a89-497d-8d43-322f9a202775', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Length': '23775', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-mel6520-MEL', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1517183297.107658,VS0,VE265', 'Vary': 'Accept-Encoding,Fastly-SSL', 'X-DNS-Prefetch-Control': 'off', 'Set-Cookie': 'prov=8edb36d8-8c63-bdd5-8d56-19bf14916c93; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly', 'Cache-Control': 'private')>
, <ClientResponse(https://aiohttp.readthedocs.io/en/stable/) [200 OK]>
<CIMultiDictProxy('Server': 'nginx/1.10.3 (Ubuntu)', 'Date': 'Sun, 28 Jan 2018 23:48:18 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Wed, 17 Jan 2018 08:45:22 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'ETag': 'W/"5a5f0d22-578a"', 'X-Subdomain-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Content-Encoding': 'gzip')>
]
答案 1 :(得分:1)
以下是演示如何使用concurrent.futures
进行并行处理的最小示例。这不包括实际的抓取逻辑,因为您可以根据需要自己添加它,但演示了要遵循的模式:
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor
def scrape_func(*args, **kwargs):
""" Stub function to use with futures - your scraping logic """
print("Do something in parallel")
return "result scraped"
def main():
start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
# The important part - concurrent futures
# - set number of workers as the number of jobs to process
with ThreadPoolExecutor(len(date_range)) as executor:
# Use list jobs for concurrent futures
# Use list scraped_results for results
jobs = []
scraped_results = []
for date in date_range:
# Pass some keyword arguments if needed - per job
kw = {"some_param": "value"}
# Here we iterate 'number of dates' times, could be different
# We're adding scrape_func, could be different function per call
jobs.append(executor.submit(scrape_func, **kw))
# Once parallell processing is complete, iterate over results
for job in futures.as_completed(jobs):
# Read result from future
scraped_result = job.result()
# Append to the list of results
scraped_results.append(scraped_result)
# Iterate over results scraped and do whatever is needed
for result is scraped_results:
print("Do something with me {}".format(result))
if __name__=="__main__":
main()
如上所述,这只是为了展示要遵循的模式,其余的应该是直截了当的。
答案 2 :(得分:0)
感谢回复家伙!事实证明,另一个问题(Make 2 functions run at the same time)中的一个非常简单的代码块似乎可以做我想要的。
import threading
from threading import Thread
def func1():
print 'Working'
def func2():
print 'Working'
if __name__ == '__main__':
Thread(target = func1).start()
Thread(target = func2).start()