我正在尝试编写一个需要同时/并行抓取某些URL的模块。因为这将是一个更昂贵的网络IO操作而不是CPU重。我正在使用ThreadPoolExecutor。
现在在我的代码中,多个函数将任务添加到共享线程池。
我的问题是主线程在所有未来对象之前被暂停 在回调函数中完成处理。
我是处理期货和ThreadPoolExecutor的初学者。任何帮助,将不胜感激。
import settings
from concurrent.futures import ThreadPoolExecutor
import concurrent.futures
class Test(Base):
WORKER_THREADS = settings.WORKER_THREADS
def __init__(self, urls):
super(Test, self).__init__()
self.urls = urls
self.worker_pool = ThreadPoolExecutor(max_workers=Test.WORKER_THREADS)
def add_to_worker_queue(self, task, callback, **kwargs):
self.logger.info("Adding task %s to worker pool.", task.func_name)
self.worker_pool.submit(task, **kwargs).add_done_callback(callback)
return
def load_url(self, url):
response = self.make_requests(urls=url) # make_requests is in Base class (it just makes a HTTP req)
# response is a generator, so to get the data out of it need to iterate through it.
for res in response:
return res
def handle_response(self, response):
# do some stuff with response and add it again to the worker queue for further parallel processing
self.add_to_worker_queue(some_task, callback_func, data=response)
return
def start(self):
for url in self.urls:
self.add_to_worker_queue(self.load_url, self.handle_response, url=[url])
return
def stop(self):
self.worker_pool.shutdown(wait=True)
return
if __name__ == "__main__":
start_urls = [ 'http://stackoverflow.com/'
, 'https://docs.python.org/3.3/library/concurrent.futures.html'
]
test = Test(urls=start_urls)
test.start()
test.stop()
根据这个例子,我尝试使用带有“with”语句的执行器。 https://docs.python.org/3.3/library/concurrent.futures.html#threadpoolexecutor-example 但是当我逐个向池中提交任务时,上面的例子等待将来的对象完成,这会使我失去目的。