我正在使用Python Selenium从页面中提取所有链接,我选择了Selenium,因为我所针对的许多网站都使用JS来呈现DOM。但是,Selenium在遍历页面<a>
元素时抛出了Stale Exception错误。
import from selenium.webdriver import Firefox
def request(url):
urls = []
browser = Firefox()
browser.implicitly_wait(7)
browser.get(url)
elements = browser.find_elements_by_xpath("//a")
for link in elements:
href = link.get_attribute('href')
urls.append(href)
browser.close()
return urls
这是selenium抛出的错误
File "/var/extractor/main.py", line 80, in request
href = link.get_attribute('href')
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webelement.py", line 113, in get_attribute
resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webelement.py", line 469, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 201, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: Element not found in the cache - perhaps the page has changed since it was looked up
因此,为了尝试解决此问题,我将get_attribute('href')
包装在try异常中。
for link in elements:
try:
href = link.get_attribute('href')
except StaleElementReferenceException:
continue
urls.append(href)
但这也不起作用!应用程序永远不会退出for循环,它只是永远挂起。所以我不太清楚该怎么做。我已经看过很多关于陈旧元素异常的帖子,但没有一个永远挂起。我觉得整个StaleElement异常是selenium中的设计缺陷,它应该能够更优雅地失败。非常感谢有关如何解决此问题的任何建议。
[编辑]:在我解决这个问题之前,我发送page_source
到lxml
来解析链接而不是Selenium。这确实解决了这个问题。但我宁愿不引入另一个依赖项并修复Selenium的问题。