This thread explains what the CPU bound, IO bound problems.
Given that the Python
has GIL
structure, someone recommended,
• Use threads for I/O bound problems
• Use processes, networking, or events (discussed in the next section) for CPU-bound problems
Honestly, I cannot fully and intuitively understand what these problems really are.
Here is the situation I'm faced with:
crawl_item_list = [item1, item2, item3 ....]
for item in crawl_item_list:
crawl_and_save_them_in_db(item)
def crawl_and_save_them_in_db(item):
# Pseudo code
# crawled_items = crawl item from the web // the number of crawled_items is usually 200
# while crawled_items:
# save them in PostgreSQL DB
# crawled_items = crawl item from the web
This is the task that I want to perform with parallel processes or thread.
(Each process(or thread) will have their own crawl_and_save_them_in_db
and deals with each item
)
In this case, which one should I choose between multi-processes(something like Pool
) and multi-thread?
I think that since the main job of this task is storing the DB
, which is kind of IO bound task(Hope it is..), so I have to use multi thread? Am I right?
Need your advices.
答案 0 :(得分:2)
这取决于将要存储的数据量。
如果有数百万条记录,那么我强烈建议使用多处理方法。它可能是python的内置多处理或第三方包。
如果数据更轻量级,那么请使用线程,甚至可以尝试gevent。
在我的爬行项目中,我首先开始使用线程,然后转移到gevent,因为它更容易支持。在我的数据成为数百万条记录之后,负责存储大量数据的部分移动到与内部线程分离的多处理模块。支持和改进它有点不舒服,但现在工作几小时的过程需要5-10分钟。