如何在Python中使用嵌套的urllib2.urlopen()加速网页抓取?

时间:2015-07-28 05:04:59

标签: python multithreading web-scraping

我有以下代码来收集书籍每章中的单词数量。简而言之,它会打开每本书的网址,然后打开与本书相关的每章的网址。

import urllib2
from bs4 import BeautifulSoup
import re

def scrapeBook(bookId):
    url = 'http://www.qidian.com/BookReader/'+str(bookId)+'.aspx'
    try:
        words = []
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)           
        try:                             
            chapters = soup.find_all('a', rel='nofollow')  # find all relevant chapters
            for chapter in chapters:                       # loop through chapters
                if 'title' in chapter.attrs: 
                    link = chapter['href']                 # go to chapter to find words
                    htmlTemp = urllib2.urlopen(link,'html').read()
                    soupTemp = BeautifulSoup(htmlTemp)

                    # find out how many words there are in each chapter
                    spans = soupTemp.find_all('span')
                    for span in spans:
                        content = span.string
                        if not content == None:
                            if u'\u5b57\u6570' in content:
                               word = re.sub("[^0-9]", "", content)
                               words.append(word)
        except: pass

        return words

    except:       
        print 'Book'+ str(bookId) + 'does not exist'    

以下是样本运行

words = scrapeBook(3501537)
print words
>> [u'2532', u'2486', u'2510', u'2223', u'2349', u'2169', u'2259', u'2194', u'2151', u'2422', u'2159', u'2217', u'2158', u'2134', u'2098', u'2139', u'2216', u'2282', u'2298', u'2124', u'2242', u'2224', u'178', u'2168', u'2334', u'2132', u'2176', u'2271', u'2237']

毫无疑问,代码非常慢。一个主要原因是我需要为每本书打开网址,每本书需要打开每章的网址。有没有办法让这个过程更快?

这是另一本没有空的3022409的bookId。它有数百章,代码永远运行。

1 个答案:

答案 0 :(得分:1)

您需要打开每本书和每章的事实由服务器上公开的视图决定。你可以做什么,它实现并行客户端。创建一个线程池,将作为作业的HTTP请求卸载到工作者,或者使用协同程序执行类似的操作。

然后就可以选择HTTP客户端库了。我发现libcurlgeventhttpclienturllib或任何其他python标准库更高效。