Question

我正在使用Python Webscraping做BeautifulSoup。

获取错误“ HTTP错误500：内部服务器错误”。

下面是我的代码

import requests
from bs4 import BeautifulSoup
import pdb
from urllib.request import urlopen
import csv
from urllib.error import HTTPError

for IPRD_ID in range(1,10):
   url = 'https://ipr.etsi.org/IPRDetails.aspx?IPRD_ID={}&IPRD_TYPE_ID=2&MODE=2'.format(IPRD_ID)
   page = urlopen(url)
   soup = BeautifulSoup(page, "lxml")
   table = soup.findAll('table', style="width:100%")
   try:
      for tr in table:
          a = (tr.get_text())
   except:
      print('exe')

我们已经看到我正在使用从1到10的range函数。我逐步检查了代码。在IPRD_ID=3页面服务器上没有数据，因此创建了{{ 1}} enter image description here

由于我们没有看到任何数据，因此即将出现错误 enter image description here

HTTP错误500：内部服务器错误我们已经看到一个500 Internal Error出现错误，如果我给1到100的范围更大，则可能会有更多错误页面。因此，我想要帮助如何跳过此类页面并前进IPRD_ID=3 < / p>

Answer 1

在您的情况下，urlopen(URL)引发了urllib.error.HTTPError异常。您可以直接捕获此异常，也可以捕获更通用的异常，例如class Exception(BaseException): pass。另外，您可以在HTTP个请求之间进行延迟（在您的情况下，强烈建议您这样做），就像我的代码中一样。

import time
import requests
from bs4 import BeautifulSoup
import pdb
import urllib
from urllib.request import urlopen
import csv
from urllib.error import HTTPError

for IPRD_ID in range(1,10):
    url = 'https://ipr.etsi.org/IPRDetails.aspx?IPRD_ID={}&IPRD_TYPE_ID=2&MODE=2'.format(IPRD_ID)
    try:
        page = urlopen(url)
    except urllib.error.HTTPError as exc:
        print('Something went wrong.')
        time.sleep(10) # wait 10 seconds and then make http request again
        continue
    else:
        print('if client get http response, start parsing.')
        soup = BeautifulSoup(page, "lxml")
        table = soup.findAll('table', style="width:100%")
        try:
            for tr in table:
                a = tr.get_text()
        except Exception as exc:
            print('Something went wrong during parsing !!!')
        finally:
            time.sleep(5) # wait 5 seconds if success, and then make HTTP request.

希望，对您有帮助。

Answer 2

尝试捕获错误代码，如果遇到错误，则继续

for IPRD_ID in range(1,10):
    url = 'https://ipr.etsi.org/IPRDetails.aspx?IPRD_ID={}&IPRD_TYPE_ID=2&MODE=2'.format(IPRD_ID)
    try:
        page = urlopen(url)
        soup = BeautifulSoup(page, "lxml")
        table = soup.findAll('table', style="width:100%")
        for tr in table:
            a = (tr.get_text())

    except  HTTPError, err:
        if err.code == 500:
            print ("Internal server error 500")
        else:
            print ("Some other error. Error code: ", err.code)

如何跳过500个内部服务器错误并使用BeautifulSoup继续进行网络抓取？

2 个答案: