试图从一个并不总是加载的页面中抓取数据(Python Selenium)

时间:2018-04-04 14:10:57

标签: python selenium exception web-scraping

我正在尝试获取仅占用大约33%时间的页面的html。我的策略是不断刷新页面,直到它最终加载。

我从另一个调用此函数,其中我已经启动了我的驱动程序(编辑为包含while语句的try / catch块,符合@jouokedleaf的建议:

def get_table(url, driver):
    driver.get(url)
    main_window = driver.current_window_handle
    html_button = driver.find_element(By.XPATH, '//*[@title="View as HTML"]')
    html_button.send_keys(Keys.CONTROL + Keys.RETURN)
    driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.TAB)
    driver.switch_to.active_element
    try:
        while 'extranet.chem' not in driver.title:
            sleep(2)
            print('refreshing to get data')
            try:
                html_button.send_keys(Keys.CONTROL + Keys.RETURN)
            except Exception:
                print('deeper exception')
                driver.refresh()
    except:
        print('while exception')
        pass

我使用嵌套的except来捕获driver.refresh()调用的可能异常。出于某种原因,即使我调用pass来忽略异常,循环也会在查找驱动程序标题时中断:

错误消息:

refreshing to get data
refreshing to get data
refreshing to get data
deeper exception
while exception
Traceback (most recent call last):
  File "scraper.py", line 83, in <module>
    get_latest()
  File "scraper.py", line 28, in get_latest
    url = row.find_element(By.XPATH, link_xpath).get_attribute('href')
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\webelement.py", line 645, in find_element
    {"using": by, "value": value})['value']
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\webelement.py", line 628, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 312, in execute
    self.error_handler.check_response(response)
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 237, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: The element reference of <tr class="ms-alternating"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

为什么这个例外不仅被忽略?

1 个答案:

答案 0 :(得分:1)

查看提供的回溯,您可以看到行while 'extranet.chem' not in driver.title:上引发异常:

File "scraper.py", line 55, in get_table
    while 'extranet.chem' not in driver.title:

不在try/except块中。我不确定在查看driver.title时我是否看到过确切的异常,但我认为这是正常的。如果您对所使用的页面一无所知,我们将无法为您提供更多帮助。您的选择是捕获在该行生成的异常。如果存在警报框,在处理警报之前,您很可能无法远离或刷新该页面。你应该建立一种处理警报的方法。