我目前正在学习使用BeautifulSoup写一个刮刀。到目前为止,我的代码工作正常,除了一些问题。首先,解释我目前正在从Fold.it项目中抓取玩家数据。由于需要删除多个页面,我一直在使用此代码块在循环结束时查找下一页。
next_link = soup.find(class_='active', title='Go to next page')
url_next = "http://www.fold.it" + next_link['href'] ### problem line???
print url_next
不幸的是,有时我会得到如下结果:
从我可以推断的,由于某种原因,下一页链接未被解析。我不确定是因为特定的网站,我写的代码,还是完全不同的东西。到目前为止,我已经尝试编写代码来检查它是否返回NoneType,但它仍然会出错。
我正在寻找的理想行为是刮到最后一页。但是,如果确实发生了错误,请重试同一页面。我所做的任何想法,输入或明显的错误都将受到高度赞赏!
以下完整代码:
import os
import urllib2
import csv
import time
from bs4 import BeautifulSoup
url_next = 'http://www.fold.it/portal/players/s_all'
url_last = ''
today_string = time.strftime('%m_%d_%Y')
location = '/home/' + 'daily_soloist_' + today_string + '.csv'
mode = 'a' if os.path.exists(location) else 'w'
with open(location, mode) as my_csv:
while True:
soup = BeautifulSoup(urllib2.urlopen(url_next).read(), "lxml")
if url_next == url_last:
print "Scraping Complete"
break
for row in soup('tr', {'class':'even'}):
cells = row('td')
#current rank
rank = cells[0].text
#finds first text node - user name
name = cells[1].a.find(text=True).strip()
#separates ranking
rank1, rank2 = cells[1].find_all("span")
#total global score
score = row('td')[2].string
data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]
#writes to csv
database = csv.writer(my_csv, delimiter=',')
database.writerows(data)
next_link = soup.find(class_='active', title='Go to next page')
url_next = "http://www.fold.it" + next_link['href'] ### problem line???
print url_next
last_link = soup.find(class_='active', title = 'Go to last page')
url_last = "http://www.fold.it" + last_link['href']
答案 0 :(得分:0)
对于修复,您可以放入以下try:
except:
块。 (你应该添加比我更多的错误处理)如果尝试失败,你就不会改变url_next
值。但是要小心,如果在同一页面上出现错误,您将处于无限循环中。
try:
if url_next == url_last:
print "Scraping Complete"
break
for row in soup('tr', {'class':'even'}):
cells = row('td')
#current rank
rank = cells[0].text
#finds first text node - user name
name = cells[1].a.find(text=True).strip()
#separates ranking
rank1, rank2 = cells[1].find_all("span")
#total global score
score = row('td')[2].string
data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]
#writes to csv
database = csv.writer(my_csv, delimiter=',')
database.writerows(data)
next_link = soup.find(class_='active', title='Go to next page')
url_next = "http://www.fold.it" + next_link['href'] ### problem line???
except: #if the above bombs out, maintain the same url_next
print "problem with this page, try again"
print url_next