使用BeautifulSoup检查是否存在下一页

时间:2015-11-09 23:51:57

标签: python web-scraping beautifulsoup

我目前正在学习使用BeautifulSoup写一个刮刀。到目前为止,我的代码工作正常,除了一些问题。首先,解释我目前正在从Fold.it项目中抓取玩家数据。由于需要删除多个页面,我一直在使用此代码块在循环结束时查找下一页。

   next_link = soup.find(class_='active', title='Go to next page')
   url_next = "http://www.fold.it" + next_link['href'] ### problem line???
   print url_next

不幸的是,有时我会得到如下结果:1

从我可以推断的,由于某种原因,下一页链接未被解析。我不确定是因为特定的网站,我写的代码,还是完全不同的东西。到目前为止,我已经尝试编写代码来检查它是否返回NoneType,但它仍然会出错。

我正在寻找的理想行为是刮到最后一页。但是,如果确实发生了错误,请重试同一页面。我所做的任何想法,输入或明显的错误都将受到高度赞赏!

以下完整代码:

import os
import urllib2
import csv
import time
from bs4 import BeautifulSoup

url_next = 'http://www.fold.it/portal/players/s_all'
url_last = ''

today_string = time.strftime('%m_%d_%Y')
location = '/home/' + 'daily_soloist_' + today_string + '.csv'

mode = 'a' if os.path.exists(location) else 'w'
with open(location, mode) as my_csv:
while True:
    soup = BeautifulSoup(urllib2.urlopen(url_next).read(), "lxml")
    if url_next == url_last:
        print "Scraping Complete"
        break

    for row in soup('tr', {'class':'even'}):
        cells = row('td')

  #current rank
        rank = cells[0].text

  #finds first text node - user name
        name = cells[1].a.find(text=True).strip()

  #separates ranking
        rank1, rank2 = cells[1].find_all("span")

  #total global score
        score = row('td')[2].string

        data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]

  #writes to csv
        database = csv.writer(my_csv, delimiter=',')
        database.writerows(data)  


   next_link = soup.find(class_='active', title='Go to next page')
   url_next = "http://www.fold.it" + next_link['href'] ### problem line???
   print url_next

   last_link = soup.find(class_='active', title = 'Go to last page')
   url_last = "http://www.fold.it" + last_link['href']

1 个答案:

答案 0 :(得分:0)

对于修复,您可以放入以下try: except:块。 (你应该添加比我更多的错误处理)如果尝试失败,你就不会改变url_next值。但是要小心,如果在同一页面上出现错误,您将处于无限循环中。

try:
    if url_next == url_last:
        print "Scraping Complete"
        break

    for row in soup('tr', {'class':'even'}):
        cells = row('td')

        #current rank
        rank = cells[0].text

        #finds first text node - user name
        name = cells[1].a.find(text=True).strip()

        #separates ranking
        rank1, rank2 = cells[1].find_all("span")

        #total global score
        score = row('td')[2].string

        data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]

        #writes to csv
        database = csv.writer(my_csv, delimiter=',')
        database.writerows(data)  


    next_link = soup.find(class_='active', title='Go to next page')
    url_next = "http://www.fold.it" + next_link['href'] ### problem line???

except:  #if the above bombs out, maintain the same url_next
    print "problem with this page, try again"

print url_next