Question

我有一个我想要使用urllib检查的网址列表。它正常工作，直到它遇到阻止请求的网站。在这种情况下，我只想跳过它并继续从列表中的下一个URL。知道怎么做吗？

以下是完整错误：

Traceback (most recent call last):
  File "C:/Users/Goris/Desktop/ssser/link.py", line 51, in <module>
    x = urllib.request.urlopen(req)
  File "C:\Users\Goris\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\Goris\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 532, in open
    response = meth(req, response)
  File "C:\Users\Goris\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\Goris\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 570, in error
    return self._call_chain(*args)
  File "C:\Users\Goris\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 504, in _call_chain
    result = func(*args)
  File "C:\Users\Goris\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Answer 1

您看到的错误只是表明服务器已经标记了所请求的资源 - 即您尝试访问的网址 - 被禁止访问。它没有给出为什么禁止资源的任何指示，尽管这种错误的最常见原因是您需要先登录。

但无论如何，它并不重要。跳过此页面并继续下一页的方法是捕获引发的错误并忽略它。如果您的URL访问代码在循环内，如下所示：

while <condition>:
    x = urllib.request.urlopen(req)
    <more code>

或

for req in <list>:
    x = urllib.request.urlopen(req)
    <more code>

然后可能最容易捕获并忽略错误的方法是：

while <condition>:
    try:
        x = urllib.request.urlopen(req)
    except urllib.error.HTTPError as e:
        if e.code in (..., 403, ...):
            continue
    <more code>

其中continue立即跳转到循环的下一次迭代。或者您可以将处理代码移动到函数：

def process_url(x):
    <more code>

while <condition>:
    try:
        x = urllib.request.urlopen(req)
    except urllib.error.HTTPError as e:
        if e.code in (..., 403, ...):
            continue
        else:
            process_url(x)
    else:
        process_url(x)

另一方面，如果您的网址访问代码已经在函数中，则可以return。

def access_url(req)
    try:
        x = urllib.request.urlopen(req)
    except urllib.error.HTTPError as e:
        if e.code in (..., 403, ...):
            return
    <more code>

我强烈建议您了解the HTTP status codes，并注意the errors that urllib.request can generate。

Answer 2

我没有试过这个，也不知道urlib，但你可以使用try和except语句来捕获错误并在此后继续。你可以试试

try:
    #connect-to-site
except:
    #connect-to-next-site

如果您想要经常捕捉异常，可以使用

def func():
    try:
        #connect-to-site
    except:
        func()

虽然不建议这样做，因为你冒着炸毁堆叠的风险（Matteo Italia）

如何跳过在Python 3中提供HTTP 403错误代码的网站？

2 个答案: