Question

我正在编写一个脚本：

从数据库中获取网址列表（约10000个网址）
下载所有页面并将其插入db
解析代码
if（某些条件）在db

我有一个带超线程的Xeon四核，所以共有8个线程可用，我在Linux（64位）下。

我使用cStringIO作为缓冲区，pycurl用于获取页面，BeautifulSoup用于解析页面，MySQLdb用于与数据库进行交互。

我尝试简化下面的代码（删除所有try / except，解析操作，...）。

import cStringIO, threading, MySQLdb.cursors, pycurl

NUM_THREADS = 100
lock_list = threading.Lock()
lock_query = threading.Lock()


db = MySQLdb.connect(host = "...", user = "...", passwd = "...", db = "...", cursorclass=MySQLdb.cursors.DictCursor)
cur = db.cursor()
cur.execute("SELECT...")
rows = cur.fetchall()
rows = [x for x in rows]  # convert to a list so it's editable


class MyThread(threading.Thread):
    def run(self):
        """ initialize a StringIO object and a pycurl object """

        while True:
            lock_list.acquire()  # acquire the lock to extract a url
            if not rows:  # list is empty, no more url to process
                lock_list.release()
                break
            row = rows.pop()
            lock_list.release()

            """ download the page with pycurl and do some check """

            """ WARNING: possible bottleneck if all the pycurl
                connections are waiting for the timeout """

            lock_query.acquire()
            cur.execute("INSERT INTO ...")  # insert the full page into the database
            db.commit()
            lock_query.release()

            """do some parse with BeautifulSoup using the StringIO object"""

            if something is not None:
                lock_query.acquire()
                cur.execute("INSERT INTO ...")  # insert the result of parsing into the database
                db.commit()
                lock_query.release()


# create and start all the threads
threads = []
for i in range(NUM_THREADS):
    t = MyThread()
    t.start()
    threads.append(t)

# wait for threads to finish
for t in threads:
    t.join()

我使用multithreading所以我不需要等待一些请求因超时而失败。该特定线程将等待，但其他人可以继续使用其他网址。

Here除了脚本之外什么都不做的截图。似乎5个核心正忙，而另一个核心没有。所以问题是：

我应该创建与线程数一样多的游标吗？
我真的需要锁定查询的执行吗？如果一个线程执行 cur.execute（）而不是 db.commit（）而另一个线程执行 execution + commit 另一个查询？
我读到了Queue类，但我不确定我是否理解正确：我可以使用它而不是 lock + extract url + release 吗？
使用multithreading我是否会遇到I / O（网络）瓶颈？使用100个线程时，我的速度不会超过~500Kb / s，而我的连接可以更快。如果我转到multiprocess，我会在这方面看到一些改进吗？
同样的问题，但使用MySQL：使用我的代码，这方面可能存在瓶颈？所有 lock + insert query + release 都可以通过某种方式得到改进吗？
如果要走的路是multithreading，那么100个线程数量是多少？我的意思是，由于这些操作的相互排斥，执行I / O请求（或数据库查询）的线程太多是无用的？或者更多的线程意味着更多的网络速度？

通过网络IO和数据库查询改进多线程

0 个答案: