我正在编写一个脚本:
我有一个带超线程的Xeon四核,所以共有8个线程可用,我在Linux(64位)下。
我使用cStringIO
作为缓冲区,pycurl
用于获取页面,BeautifulSoup
用于解析页面,MySQLdb
用于与数据库进行交互。
我尝试简化下面的代码(删除所有try / except,解析操作,...)。
import cStringIO, threading, MySQLdb.cursors, pycurl
NUM_THREADS = 100
lock_list = threading.Lock()
lock_query = threading.Lock()
db = MySQLdb.connect(host = "...", user = "...", passwd = "...", db = "...", cursorclass=MySQLdb.cursors.DictCursor)
cur = db.cursor()
cur.execute("SELECT...")
rows = cur.fetchall()
rows = [x for x in rows] # convert to a list so it's editable
class MyThread(threading.Thread):
def run(self):
""" initialize a StringIO object and a pycurl object """
while True:
lock_list.acquire() # acquire the lock to extract a url
if not rows: # list is empty, no more url to process
lock_list.release()
break
row = rows.pop()
lock_list.release()
""" download the page with pycurl and do some check """
""" WARNING: possible bottleneck if all the pycurl
connections are waiting for the timeout """
lock_query.acquire()
cur.execute("INSERT INTO ...") # insert the full page into the database
db.commit()
lock_query.release()
"""do some parse with BeautifulSoup using the StringIO object"""
if something is not None:
lock_query.acquire()
cur.execute("INSERT INTO ...") # insert the result of parsing into the database
db.commit()
lock_query.release()
# create and start all the threads
threads = []
for i in range(NUM_THREADS):
t = MyThread()
t.start()
threads.append(t)
# wait for threads to finish
for t in threads:
t.join()
我使用multithreading
所以我不需要等待一些请求因超时而失败。该特定线程将等待,但其他人可以继续使用其他网址。
Here除了脚本之外什么都不做的截图。似乎5个核心正忙,而另一个核心没有。所以问题是:
multithreading
我是否会遇到I / O(网络)瓶颈?使用100个线程时,我的速度不会超过~500Kb / s,而我的连接可以更快。如果我转到multiprocess
,我会在这方面看到一些改进吗?multithreading
,那么100个线程数量是多少?我的意思是,由于这些操作的相互排斥,执行I / O请求(或数据库查询)的线程太多是无用的?或者更多的线程意味着更多的网络速度?