我正在尝试使用BeautifulSoup废弃一些网址。我正在抓取的网址来自谷歌分析API调用,其中一些不能正常工作,所以我需要找到一种方法来跳过它们。
我试着添加这个:
except urllib2.HTTPError:
continue
但是我收到了以下语法错误:
except urllib2.HTTPError:
^
SyntaxError: invalid syntax
这是我的完整代码:
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row,share))
print(sharelist)
答案 0 :(得分:2)
您的except
语句前面没有try
语句。您应该使用以下模式:
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
还要注意缩进级别。在try
子句下执行的代码必须缩进,以及except
子句。
答案 1 :(得分:2)
两个错误:
1.没有try
声明
2.没有缩进
使用此:
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
答案 2 :(得分:1)
如果您只是想要捕获 404 ,则需要检查返回的代码或引发错误,否则您将捕获并忽略的不仅仅是404:
import urllib2
from bs4 import BeautifulSoup
from urlparse import urljoin
def print_results(results):
base = 'http://www.konbini.com'
rawdata = []
sharelist = []
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
# use urljoin to join to the base url
urllist = [urljoin(base, h) for h in rawdata]
for url in urllist:
# query the website and return the html to the variable 'page'
try: # need to open with try
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # check the return code
continue
raise # if other than 404, raise the error
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((url, share))
print(sharelist)
答案 3 :(得分:0)
您的语法错误是由于您使用try
声明错过except
这一事实。
try:
# code that might throw HTTPError
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
答案 4 :(得分:0)
正如其他人已经提到的那样,
您应该使用IDE或编辑器,这样您就不会遇到这样的问题,一些优秀的IDE和编辑器
无论如何,尝试并缩进后的代码
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row, share))
print(sharelist)