Question

我在python中使用urllib2和BeautifulSoup进行网页抓取，并且不断将抓取的内容保存到文件中。我注意到我的进度越来越慢，最终在4到8小时内停止，即使是像

这样简单的事情

import urllib2
from bs4 import BeautifulSoup

def searchBook():
    fb = open(r'filePath', 'a')
    for index in range(3510000,3520000):
        url = 'http://www.qidian.com/Book/' + str(index) + '.aspx'
        try:
            html = urllib2.urlopen(url,'html').read()
            soup = BeautifulSoup(html)
            stats = getBookStats(soup)
            fb.write(str(stats))
            fb.write('\n')                
        except:
            print url + 'doesn't exist'
    fb.close()


def getBookStats(soup):                                         # extract book info from script
    stats = {}
    stats['trialStatus'] = soup.find_all('span',{'itemprop':'trialStatus'})[0].string
    stats['totalClick'] = soup.find_all('span',{'itemprop':'totalClick'})[0].string
    stats['monthlyClick'] = soup.find_all('span',{'itemprop':'monthlyClick'})[0].string
    stats['weeklyClick'] = soup.find_all('span',{'itemprop':'weeklyClick'})[0].string
    stats['genre'] = soup.find_all('span',{'itemprop':'genre'})[0].string
    stats['totalRecommend'] = soup.find_all('span',{'itemprop':'totalRecommend'})[0].string
    stats['monthlyRecommend'] = soup.find_all('span',{'itemprop':'monthlyRecommend'})[0].string
    stats['weeklyRecommend'] = soup.find_all('span',{'itemprop':'weeklyRecommend'})[0].string
    stats['updataStatus'] = soup.find_all('span',{'itemprop':'updataStatus'})[0].string
    stats['wordCount'] = soup.find_all('span',{'itemprop':'wordCount'})[0].string
    stats['dateModified'] = soup.find_all('span',{'itemprop':'dateModified'})[0].string
    return stats

我的问题是

1）此代码的瓶颈是什么，urllib2.urlopen（）或soup.find_all（）？

2）我可以告诉代码已停止的唯一方法是检查输出文件。然后我手动重启它停止的过程。有没有更有效的方法来告诉代码已停止？有没有办法自动重启？

3）当然，最好的办法是防止代码完全放慢速度和停止。我可以查看哪些可能的地方？

我目前正在尝试回答和评论的建议

1）@DavidEhrmann

url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
with urllib2.urlopen(url,'html') as u: html = u.read()
# html = urllib2.urlopen(url,'html').read()
--------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-8b6f635f6bd5> in <module>()
      1 url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
----> 2 with urllib2.urlopen(url,'html') as u: html = u.read()
      3 html = urllib2.urlopen(url,'html').read()
      4 soup = BeautifulSoup(html)

AttributeError: addinfourl instance has no attribute '__exit__'

2）@Stardustone

在各个位置添加sleep（）命令后程序仍然停止。

Answer 1

我怀疑平均系统负载过高，尝试在sleep(0.5)部分为每次迭代添加try：

     try:
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)
        stats = getBookStats(soup)
        fb.write(str(stats))
        fb.write('\n')
        time.sleep(0.5)

Answer 2

请参阅this answer，了解如何测试函数调用的时间。这样您就可以确定urlopen()是否会变慢。

正如@halfer说的那样，你正在抓取的网站并不想让你刮掉很多东西，并且正在逐步限制你的请求。检查他们的服务条款，并检查他们是否提供API作为刮擦的替代方案。

网络抓取逐渐变慢并最终停止的可能原因是什么？

2 个答案: