如何在不重复ID的情况下并行化for循环

时间:2018-03-27 19:34:57

标签: python python-3.x parallel-processing multiprocessing

我是Python的新手。我正在通过网站编制索引并从中删除值,但由于有100k页要编入索引,因此需要花费大量时间。我想知道如何加快速度。我读到多线程可能会发生冲突/不起作用,多处理是最好的开始方式。

以下是我的代码示例:

def main():
    for ID in range(1, 100000):
        requests.get("example.com/?id=" + str(ID))
        #do stuff/print html elements off of url.  

如果我这样做:

if __name__ == '__main__':
    for i in range(50):
        p = multiprocessing.Process(target=main)
        p.start()

它确实并行运行该函数,但我只希望每个进程都刮取一个尚未被另一个进程擦除的ID。如果我做p.join()它似乎没有增加速度,没有多处理,所以我不知道该怎么做。

1 个答案:

答案 0 :(得分:2)

以下是基于concurrent.futures module.

的示例
import concurrent.futures

# Retrieve a single page and report the URL and contents
def load_url(page_id, timeout):
   requests.get("example.com/?id=" + str(page_id))
   return do_stuff(request)  #do stuff on html elements off of url.  


# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, page_id, 60): page_id for page_id in range(1,100000)}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))