我有以下代码来收集书籍每章中的单词数量。简而言之,它会打开每本书的网址,然后打开与本书相关的每章的网址。
import urllib2
from bs4 import BeautifulSoup
import re
def scrapeBook(bookId):
url = 'http://www.qidian.com/BookReader/'+str(bookId)+'.aspx'
try:
words = []
html = urllib2.urlopen(url,'html').read()
soup = BeautifulSoup(html)
try:
chapters = soup.find_all('a', rel='nofollow') # find all relevant chapters
for chapter in chapters: # loop through chapters
if 'title' in chapter.attrs:
link = chapter['href'] # go to chapter to find words
htmlTemp = urllib2.urlopen(link,'html').read()
soupTemp = BeautifulSoup(htmlTemp)
# find out how many words there are in each chapter
spans = soupTemp.find_all('span')
for span in spans:
content = span.string
if not content == None:
if u'\u5b57\u6570' in content:
word = re.sub("[^0-9]", "", content)
words.append(word)
except: pass
return words
except:
print 'Book'+ str(bookId) + 'does not exist'
以下是样本运行
words = scrapeBook(3501537)
print words
>> [u'2532', u'2486', u'2510', u'2223', u'2349', u'2169', u'2259', u'2194', u'2151', u'2422', u'2159', u'2217', u'2158', u'2134', u'2098', u'2139', u'2216', u'2282', u'2298', u'2124', u'2242', u'2224', u'178', u'2168', u'2334', u'2132', u'2176', u'2271', u'2237']
毫无疑问,代码非常慢。一个主要原因是我需要为每本书打开网址,每本书需要打开每章的网址。有没有办法让这个过程更快?
这是另一本没有空的3022409的bookId。它有数百章,代码永远运行。
答案 0 :(得分:1)
您需要打开每本书和每章的事实由服务器上公开的视图决定。你可以做什么,它实现并行客户端。创建一个线程池,将作为作业的HTTP请求卸载到工作者,或者使用协同程序执行类似的操作。
然后就可以选择HTTP客户端库了。我发现libcurl
和geventhttpclient
比urllib
或任何其他python标准库更高效。