Python: What should I use multi-process or multi-thread in DB related tasks?

时间:2018-02-03 08:58:27

标签: python multithreading multiprocessing python-multiprocessing

This thread explains what the CPU bound, IO bound problems.

Given that the Python has GIL structure, someone recommended,

• Use threads for I/O bound problems
• Use processes, networking, or events (discussed in the next section) for CPU-bound problems

Honestly, I cannot fully and intuitively understand what these problems really are.

Here is the situation I'm faced with:

crawl_item_list = [item1, item2, item3 ....]
for item in crawl_item_list:
    crawl_and_save_them_in_db(item)

def crawl_and_save_them_in_db(item):
    # Pseudo code    
    # crawled_items = crawl item from the web  // the number of crawled_items  is usually 200
    # while crawled_items:
    #    save them in PostgreSQL DB
    #    crawled_items = crawl item from the web    

This is the task that I want to perform with parallel processes or thread. (Each process(or thread) will have their own crawl_and_save_them_in_db and deals with each item)

In this case, which one should I choose between multi-processes(something like Pool) and multi-thread?

I think that since the main job of this task is storing the DB, which is kind of IO bound task(Hope it is..), so I have to use multi thread? Am I right?

Need your advices.

1 个答案:

答案 0 :(得分:2)

这取决于将要存储的数据量。

如果有数百万条记录,那么我强烈建议使用多处理方法。它可能是python的内置多处理或第三方包。

如果数据更轻量级,那么请使用线程,甚至可以尝试gevent。

在我的爬行项目中,我首先开始使用线程,然后转移到gevent,因为它更容易支持。在我的数据成为数百万条记录之后,负责存储大量数据的部分移动到与内部线程分离的多处理模块。支持和改进它有点不舒服,但现在工作几小时的过程需要5-10分钟。