URLLib2.URL错误:读取服务器响应代码(Python)

时间:2014-06-25 22:59:02

标签: python urllib2 urllib http-status-codes

我有一个网址列表。我想查看每个服务器的响应代码,看看是否有任何损坏。我可以读取服务器错误(500)和损坏的链接(404),但是一旦读取非网站(例如" notawebsite_broken.com"),代码就会中断。我已经四处搜索而没有找到答案......我希望你能帮忙。

以下是代码:

import urllib2

#List of URLs. The third URL is not a website
urls = ["http://www.google.com","http://www.ebay.com/broken-link",
"http://notawebsite_broken"]

#Empty list to store the output
response_codes = []

# Run "for" loop: get server response code and save results to response_codes
for url in urls:
    try:
        connection = urllib2.urlopen(url)
        response_codes.append(connection.getcode())
        connection.close()
        print url, ' - ', connection.getcode()
    except urllib2.HTTPError, e:
        response_codes.append(e.getcode())
        print url, ' - ', e.getcode()

print response_codes

这给出了......

的输出
http://www.google.com  -  200
http://www.ebay.com/broken-link  -  404
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    connection = urllib2.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

有没有人知道这方面的解决办法,还是有人能指出我正确的方向?

3 个答案:

答案 0 :(得分:3)

您可以使用请求:

import requests

urls = ["http://www.google.com","http://www.ebay.com/broken-link",
"http://notawebsite_broken"]

for u in urls:
    try:
        r = requests.get(u)
        print "{} {}".format(u,r.status_code)
    except Exception,e:
        print "{} {}".format(u,e)

http://www.google.com 200
http://www.ebay.com/broken-link 404
http://notawebsite_broken HTTPConnectionPool(host='notawebsite_broken', port=80): Max retries exceeded with url: /

答案 1 :(得分:1)

当urllib2.urlopen()无法连接到服务器或无法解析主机的IP时,它会引发URLError而不是HTTPError。除了urllib2.HTTPError之外,您还需要捕获urllib2.URLError来处理这些情况。

答案 2 :(得分:1)

urllib2库的API是一场噩梦。

很多人(包括我自己)强烈建议您使用requests包:

关于requests的一个更好的事情是任何请求问题都从基础Exception类继承。当你使用urllib2“raw”时,除了urllib2模块以及可能还有其他模块之外,还可以从socket引发一些异常(我记不住了,但它的凌乱) )

tldr - 只需使用requests库。