我是Python的新手。我正在通过网站编制索引并从中删除值,但由于有100k页要编入索引,因此需要花费大量时间。我想知道如何加快速度。我读到多线程可能会发生冲突/不起作用,多处理是最好的开始方式。
以下是我的代码示例:
def main():
for ID in range(1, 100000):
requests.get("example.com/?id=" + str(ID))
#do stuff/print html elements off of url.
如果我这样做:
if __name__ == '__main__':
for i in range(50):
p = multiprocessing.Process(target=main)
p.start()
它确实并行运行该函数,但我只希望每个进程都刮取一个尚未被另一个进程擦除的ID。如果我做p.join()它似乎没有增加速度,没有多处理,所以我不知道该怎么做。
答案 0 :(得分:2)
以下是基于concurrent.futures module.
的示例import concurrent.futures
# Retrieve a single page and report the URL and contents
def load_url(page_id, timeout):
requests.get("example.com/?id=" + str(page_id))
return do_stuff(request) #do stuff on html elements off of url.
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, page_id, 60): page_id for page_id in range(1,100000)}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))