Question

This thread explains what the CPU bound, IO bound problems.

Given that the Python has GIL structure, someone recommended,

• Use threads for I/O bound problems
• Use processes, networking, or events (discussed in the next section) for CPU-bound problems

Honestly, I cannot fully and intuitively understand what these problems really are.

Here is the situation I'm faced with:

crawl_item_list = [item1, item2, item3 ....]
for item in crawl_item_list:
    crawl_and_save_them_in_db(item)

def crawl_and_save_them_in_db(item):
    # Pseudo code    
    # crawled_items = crawl item from the web  // the number of crawled_items  is usually 200
    # while crawled_items:
    #    save them in PostgreSQL DB
    #    crawled_items = crawl item from the web

This is the task that I want to perform with parallel processes or thread. (Each process(or thread) will have their own crawl_and_save_them_in_db and deals with each item)

In this case, which one should I choose between multi-processes(something like Pool) and multi-thread?

I think that since the main job of this task is storing the DB, which is kind of IO bound task(Hope it is..), so I have to use multi thread? Am I right?

Need your advices.

Answer 1

这取决于将要存储的数据量。

如果有数百万条记录，那么我强烈建议使用多处理方法。它可能是python的内置多处理或第三方包。

如果数据更轻量级，那么请使用线程，甚至可以尝试gevent。

在我的爬行项目中，我首先开始使用线程，然后转移到gevent，因为它更容易支持。在我的数据成为数百万条记录之后，负责存储大量数据的部分移动到与内部线程分离的多处理模块。支持和改进它有点不舒服，但现在工作几小时的过程需要5-10分钟。

Python: What should I use multi-process or multi-thread in DB related tasks?

1 个答案: