我正在尝试使用urlib2浏览一系列编号的数据页面。我想要做的是使用一个try语句,但我对它知之甚少,通过阅读一点判断,它似乎是基于特定的“名称”,这是例外,例如IOError等。我不知道是什么我正在寻找错误代码,这是问题的一部分。
我已经从'urllib2写了/粘贴了缺少的手册'我的urllib2页面提取程序因此:
def fetch_page(url,useragent)
urlopen = urllib2.urlopen
Request = urllib2.Request
cj = cookielib.LWPCookieJar()
txheaders = {'User-agent' : useragent}
if os.path.isfile(COOKIEFILE):
cj.load(COOKIEFILE)
print "previous cookie loaded..."
else:
print "no ospath to cookfile"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
try:
req = urllib2.Request(url, useragent)
# create a request object
handle = urlopen(req)
# and open it to return a handle on the url
except IOError, e:
print 'Failed to open "%s".' % url
if hasattr(e, 'code'):
print 'We failed with error code - %s.' % e.code
elif hasattr(e, 'reason'):
print "The error object has the following 'reason' attribute :"
print e.reason
print "This usually means the server doesn't exist,",
print "is down, or we don't have an internet connection."
return False
else:
print
if cj is None:
print "We don't have a cookie library available - sorry."
print "I can't show you any cookies."
else:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
cj.save(COOKIEFILE) # save the cookies again
page = handle.read()
return (page)
def fetch_series():
useragent="Firefox...etc."
url="www.example.com/01.html"
try:
fetch_page(url,useragent)
except [something]:
print "failed to get page"
sys.exit()
底部功能只是一个例子,看看我的意思,谁能告诉我应该放在那里?如果获取404,我使页面获取函数返回False,这是正确的吗?那么为什么除了False之外没有工作?感谢您提供任何帮助。
好的,根据我的建议,我试过了:
except urlib2.URLError, e:
except URLError, e:
except URLError:
except urllib2.IOError, e:
except IOError, e:
except IOError:
except urllib2.HTTPError, e:
except urllib2.HTTPError:
except HTTPError:
它们都不起作用。
答案 0 :(得分:36)
如果你想检测404:
,你应该抓住urllib2.HTTPError
try:
req = urllib2.Request(url, useragent)
# create a request object
handle = urllib2.urlopen(req)
# and open it to return a handle on the url
except urllib2.HTTPError, e:
print 'We failed with error code - %s.' % e.code
if e.code == 404:
# do stuff..
else:
# other stuff...
return False
else:
# ...
要在fetch_series()中捕获它:
def fetch_page(url,useragent)
urlopen = urllib2.urlopen
Request = urllib2.Request
cj = cookielib.LWPCookieJar()
try:
urlopen()
#...
except IOError, e:
# ...
else:
#...
def fetch_series():
useragent=”Firefox...etc.”
url=”www.example.com/01.html
try:
fetch_page(url,useragent)
except urllib2.HTTPError, e:
print “failed to get page”
http://docs.python.org/library/urllib2.html:
exception urllib2.HTTPError
虽然是一个例外(URLError
的子类),但HTTPError
可以 也可以作为一个非特殊的文件类返回值(相同urlopen()
返回的东西。处理异国情调时这很有用 HTTP错误,例如身份验证请求。
code
RFC 2616中定义的HTTP状态代码。此数值对应于找到的代码字典中的值 在BaseHTTPServer.BaseHTTPRequestHandler.responses
。
答案 1 :(得分:8)
我建议您查看精彩的requests
模块。
有了它,你可以实现你所要求的功能:
import requests
from requests.exceptions import HTTPError
try:
r = requests.get('http://httpbin.org/status/200')
r.raise_for_status()
except HTTPError:
print 'Could not download page'
else:
print r.url, 'downloaded successfully'
try:
r = requests.get('http://httpbin.org/status/404')
r.raise_for_status()
except HTTPError:
print 'Could not download', r.url
else:
print r.url, 'downloaded successfully'
答案 2 :(得分:2)
要在python中找到有关此类异常的性质和可能内容,只需以交互方式尝试关键调用:
>>> f = urllib2.urlopen('http://httpbin.org/status/404')
Traceback (most recent call last):
...
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: NOT FOUND
然后sys.last_value
包含落入交互式的异常值 - 并且可以使用以下方式播放:
(使用TAB +。交互式shell的自动扩展,dir(),vars()...)
>>> ev = sys.last_value
>>> ev.__class__
<class 'urllib2.HTTPError'>
>>> dir(ev)
['_HTTPError__super_init', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', 'args', 'close', 'code', 'errno', 'filename', 'fileno', 'fp', 'getcode', 'geturl', 'hdrs', 'headers', 'info', 'message', 'msg', 'next', 'read', 'readline', 'readlines', 'reason', 'strerror', 'url']
>>> vars(ev)
{'fp': <addinfourl at 140193880 whose fp = <socket._fileobject object at 0x01062370>>, 'fileno': <bound method _fileobject.fileno of <socket._fileobject object at 0x01062370>>, 'code': 404, 'hdrs': <httplib.HTTPMessage instance at 0x085ADF80>, 'read': <bound method _fileobject.read of <socket._fileobject object at 0x01062370>>, 'readlines': <bound method _fileobject.readlines of <socket._fileobject object at 0x01062370>>, 'next': <bound method _fileobject.next of <socket._fileobject object at 0x01062370>>, 'headers': <httplib.HTTPMessage instance at 0x085ADF80>, '__iter__': <bound method _fileobject.__iter__ of <socket._fileobject object at 0x01062370>>, 'url': 'http://httpbin.org/status/404', 'msg': 'NOT FOUND', 'readline': <bound method _fileobject.readline of <socket._fileobject object at 0x01062370>>}
>>> sys.last_value.code
404
尝试处理:
>>> try: f = urllib2.urlopen('http://httpbin.org/status/404')
... except urllib2.HTTPError, ev:
... print ev, "'s error code is", ev.code
...
HTTP Error 404: NOT FOUND 's error code is 404
>>> ho = urllib2.OpenerDirector()
>>> ho.add_handler(urllib2.HTTPHandler())
>>> f = ho.open('http://localhost:8080/cgi/somescript.py'); f
<addinfourl at 138851272 whose fp = <socket._fileobject object at 0x01062370>>
>>> f.code
500
>>> f.read()
'Execution error: <pre style="background-color:#faa">\nNameError: name \'e\' is not defined\n<pre>\n'
urllib2.build_opener
的默认处理程序:
default_classes = [ProxyHandler,UnknownHandler,HTTPHandler, HTTPDefaultErrorHandler ,HTTPRedirectHandler, FTPHandler,FileHandler, HTTPErrorProcessor ]