Question

我正试图在页面中找到损坏的链接。我可以使用该代码，但是所有页面都返回302代码。起初我虽然还可以，但是后来我手动发现一页返回404错误。然后，我开始阅读有关302代码的内容。我想我有点理解，但是仍然有办法获取重定向返回的代码吗？如果有帮助，这是我的代码：

import requests as requests
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, 
executable_path='C:\\Chromedriver\\chromedriver.exe')
driver.get('https://pageURL.com')
links = driver.find_elements_by_css_selector("a")
for link in links:
    if link.get_attribute('href') != None:
        if link.get_attribute('href')[:14] == 'https://URLstart':
            r = requests.head(link.get_attribute('href'))
            print(link.get_attribute('href'), r.status_code)

Answer 1

使用requests.head()时，默认情况下它不遵循重定向。为此，请使用allow_redirects=True。（默认情况下，其他HTTP方法遵循重定向。）

响应status_code始终是重定向后的最新/最后一个。如果您确实具有重定向并需要这些中间状态，请使用requests.history。示例：

>>> import requests
>>> r = requests.head('http://google.com')  # default behaviour for HEAD
>>> r.status_code
301
>>>
>>> r = requests.head('http://google.com', allow_redirects=True)
>>> r.status_code
200
>>> r.url
'http://www.google.com/'
>>> r.history
[<Response [301]>]
>>> r.history[0].status_code
301
>>> r.history[0].url
'http://google.com/'

有关如何遍历历史的示例，请参见this answer。

使用硒查找断开的链接。 HTTP 302，HTTP 404预期

1 个答案: