对于我的学士论文,我需要从大约40000个网站中获取一些数据。因此我使用的是python请求,但目前从服务器获取响应的速度非常慢。
无论如何加速并保持我当前的标题设置?我发现没有标题的所有教程。
这是我的代码剪辑:
def parse(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
for line in r.iter_lines():
...
答案 0 :(得分:1)
你必须使用线程,因为这是 I / O Bound 问题。使用内置的threading
库是您的最佳选择。我使用Semaphore
对象来计算现在正在运行的线程数。
import time
import threading
# Number of parallel threads
lock = threading.Semaphore(2)
def parse(url):
"""
Change to your logic, I just use sleep to mock http request.
"""
print 'getting info', url
sleep(2)
# After we done, subtract 1 from the lock
lock.release()
def parse_pool():
# List of all your urls
list_of_urls = ['website1', 'website2', 'website3', 'website4']
# List of threads objects I so we can handle them later
thread_pool = []
for url in list_of_urls:
# Create new thread that calls to your function with a url
thread = threading.Thread(target=parse, args=(url,))
thread_pool.append(thread)
thread.start()
# Add one to our lock, so we will wait if needed.
lock.acquire()
for thread in thread_pool:
thread.join()
print 'done'
答案 1 :(得分:0)
您可以使用asyncio同时运行任务。您可以使用返回的asyncio.wait()
值列出网址响应(已完成和待处理的响应)并异步调用协同程序。结果将是一个意想不到的顺序,但它是一种更快的方法。
import asyncio
import functools
async def parse(url):
print('in parse for url {}'.format(url))
info = await #write the logic for fetching the info, it waits for the responses from the urls
print('done with url {}'.format(url))
return 'parse {} result from {}'.format(info, url)
async def main(sites):
print('starting main')
parses = [
parse(url)
for url in sites
]
print('waiting for phases to complete')
completed, pending = await asyncio.wait(parses)
results = [t.result() for t in completed]
print('results: {!r}'.format(results))
event_loop = asyncio.get_event_loop()
try:
websites = ['site1', 'site2', 'site3']
event_loop.run_until_complete(main(websites))
finally:
event_loop.close()
答案 2 :(得分:-1)
我认为使用mutil-thread
或threading
之类的multiprocess
是个好主意,或者由于grequests
<而您可以使用gevent
(异步请求)< / p>