Question

我目前正在通过Ryan Mitchell使用Python进行Web Scraping。在第一章中，当他谈到处理错误时，他说：

如果根本找不到服务器（例如，网站已关闭，或URL）错误输入），urlopen返回None个对象。

为了测试这个，我创建了以下代码段。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup as bs

def getTitle(url):

    try:
        html = urlopen(url).read()
    except HTTPError:
        return None

    try:
        bsObj = bs(html)
    except AttributeError:
        return None
    return bsObj

title = getTitle('http://www.wunderlst.com')
print(title)

在此代码的倒数第二行，我故意错误输入了网址名称（实际网址为http://www.wunderlist.com）。我希望现在我会在屏幕上打印None。但是，我得到了很多错误。下面我将给出错误消息的最后一部分：

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ex4.py", line 18, in <module>
    title = getTitle('http://www.wunderlst.com')
  File "ex4.py", line 8, in getTitle
    html = urlopen(url).read()
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 463, in open
    response = self._open(req, data)
  File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
    '_open', req)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.4/urllib/request.py", line 1184, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

现在，如果我更正了URL名称，但在网站前面写了一些不存在的页面，例如：

title = getTitle('http://www.wunderlist.com/something')

然后我在屏幕上打印None。我真的很困惑。任何人都可以向我解释实际发生的事情吗？提前致谢。

Answer 1

我认为问题是你只捕捉HTTPError（并返回无）。尝试同时处理/捕获URLError例外。

替换
from urllib.error import HTTPError
与
from urllib.error import HTTPError, URLError。

替换
except HTTPError:
与
except (HTTPError, URLError):

这将为您提供所需的行为（在两种情况下都返回None）。但我建议单独处理这些错误（将第一个try块移动到另一个方法，停止抓取错误等）。

Answer 2

您所指的书/文章错误或过时。在urllib documentation中，您可以阅读

如果无法建立连接，则会引发IOError异常。

如果无法解析主机名，显然无法建立连接，因此必须根据文档引发IOError。 URLError是旧Pythons中IOError的子类，较新版本的urllib似乎没有urlopen函数，而是从粗略的一瞥中看出来的。{/ p>

正如评论中所提到的，我得到了错误的库（urllib而不是urllib.request）;你会发现类似的一句话

在错误时引发URLError。

但在那里。据推测，像404这样的HTTP错误不被视为urlopen的错误，这就是为什么如果路径错误它不会引发异常，但如果主机名无法解析则会抛出错误。

Answer 3

通常会引发URLError，因为没有网络连接（没有到指定服务器的路由），或者指定的服务器不存在。

＆＃39; http://www.wunderlst.com＆＃39;是不存在的，这就是错误引起的原因。

请查看以下链接以获取更多详细信息。

https://docs.python.org/3.1/howto/urllib2.html#handling-exceptions

当url输入错误时，urlopen不返回无对象

3 个答案: