Question

我正在使用Selenium 2（版本2.33 Python绑定，Firefox驱动程序）编写通用的Web抓取工具。它应该采用任意 URL，加载页面，并报告所有出站链接。因为URL是任意的，所以我不能对页面的内容做任何假设，因此通常的建议（等待特定元素存在）是不适用的。

我有代码，它应该轮询document.readyState直到它达到“完成”或30秒超时，然后继续：

def readystate_complete(d):
    # AFAICT Selenium offers no better way to wait for the document to be loaded,
    # if one is in ignorance of its contents.
    return d.execute_script("return document.readyState") == "complete"

def load_page(driver, url):
    try:
        driver.get(url)
        WebDriverWait(driver, 30).until(readystate_complete)
    except WebDriverException:
        pass

    links = []
    try:
        for elt in driver.find_elements_by_xpath("//a[@href]"):
            try: links.append(elt.get_attribute("href"))
            except WebDriverException: pass
    except WebDriverException: pass
    return links

这种类型有效，但在大约五分之一的页面上，.until调用会永远挂起。当发生这种情况时，通常浏览器实际上没有完成加载页面（“throbber”仍然在旋转），但是几十分钟就可以完成并且超时不会触发。但有时页面似乎确实已完全加载，脚本仍然没有继续。

是什么给出的？如何使超时可靠地工作？是否有更好的方法来请求等待页面加载（如果不能对内容做出任何假设）？

注意：WebDriverException的强制捕捉和忽略已被证明是必要的，以确保它从页面中提取尽可能多的链接，无论页面内的JavaScript是否与DOM一起做有趣的事情（例如，我曾经在提取HREF属性的循环中得到“陈旧元素”错误。）

注意：此网站和其他地方的这个问题有很多变化，但是他们都有一个微妙但关键的差异，使得答案（如果有的话）无用到我，或者我已经尝试过这些建议，但它们不起作用。 请回答完全我提出的问题。

Answer 1

我有类似的情况，因为我使用Selenium为一个相当知名的网站服务编写了截图系统，并且遇到了同样的困境：我对所加载的页面一无所知。

在与一些Selenium开发人员交谈后，答案是各种WebDriver实现（例如Firefox Driver与IEDriver）对于何时考虑加载页面以使WebDriver返回控制权做出了不同的选择。

如果你深入研究Selenium代码，你可以找到尝试并做出最佳选择的点，但是因为有许多事情可能导致状态被寻找失败，比如多个帧，其中一个没有及时完成，有些情况下司机显然不会返回。

有人告诉我，“它是一个开源项目”，它可能不会/不能针对每种可能的情况进行纠正，但我可以在适用的情况下修复并提交补丁。

从长远来看，这对我来说有点太多了，与你相似，我创建了自己的超时过程。因为我使用Java，所以我创建了一个新的Thread，在达到超时时，尝试做几件事让WebDriver返回，即使有时只是按某些键让浏览器响应也有效。如果它没有返回，那么我杀了浏览器并再次尝试。

再次启动驱动程序已经为我们处理了大多数情况，好像浏览器的第二次加载允许它处于更稳定的状态（请注意，我们从VM启动并且浏览器不断想要检查更新并运行最近没有推出的某些例程）。

另一部分是我们首先启动一个已知的网址并确认浏览器的某些方面，并且我们实际上能够在继续之前与其进行交互。通过这些步骤，失败率相当低，在所有浏览器/版本/操作系统（FF，IE，CHROME，Safari，Opera，iOS，Android等）上进行1000次测试时，失败率约为3％

最后但并非最不重要的是，对于您的情况，听起来您只需要捕获页面上的链接，而不是完全的浏览器自动化。我还可以采用其他方法，即名称为cURL和linux的工具。

Answer 2

“推荐”（但仍然很丑陋）解决方案可能是使用explicit wait：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions

old_value = browser.find_element_by_id('thing-on-old-page').text
browser.find_element_by_link_text('my link').click()
WebDriverWait(browser, 3).until(
    expected_conditions.text_to_be_present_in_element(
        (By.ID, 'thing-on-new-page'),
        'expected new text'
    )
)

天真的尝试将是这样的：

def wait_for(condition_function):
    start_time = time.time()
    while time.time() < start_time + 3:
        if condition_function():
            return True
        else:
            time.sleep(0.1)
    raise Exception(
        'Timeout waiting for {}'.format(condition_function.__name__)
    )


def click_through_to_new_page(link_text):
    browser.find_element_by_link_text('my link').click()

    def page_has_loaded():
        page_state = browser.execute_script(
            'return document.readyState;'
        ) 
        return page_state == 'complete'

    wait_for(page_has_loaded)

另一个，更好的是（@ThomasMarks）：

def click_through_to_new_page(link_text):
    link = browser.find_element_by_link_text('my link')
    link.click()

    def link_has_gone_stale():
        try:
            # poll the link with an arbitrary call
            link.find_elements_by_id('doesnt-matter') 
            return False
        except StaleElementReferenceException:
            return True

    wait_for(link_has_gone_stale)

最后一个例子包括比较下面的页面ID（可能是防弹的）：

class wait_for_page_load(object):

    def __init__(self, browser):
        self.browser = browser

    def __enter__(self):
        self.old_page = self.browser.find_element_by_tag_name('html')

    def page_has_loaded(self):
        new_page = self.browser.find_element_by_tag_name('html')
        return new_page.id != self.old_page.id

    def __exit__(self, *_):
        wait_for(self.page_has_loaded)

现在我们可以做到：

with wait_for_page_load(browser):
    browser.find_element_by_link_text('my link').click()

以上代码示例来自Harry's blog。

Answer 3

据我所知，你的readystate_complete没有做任何事情，因为driver.get（）已经在检查那个条件。无论如何，我看到它在很多情况下都不起作用。您可以尝试的一件事是通过代理路由您的流量，并使用它来ping任何网络流量。即browsermob有wait_for_traffic_to_stop方法：

def wait_for_traffic_to_stop(self, quiet_period, timeout):
"""
Waits for the network to be quiet
:Args:
- quiet_period - number of seconds the network needs to be quiet for
- timeout - max number of seconds to wait
"""
    r = requests.put('%s/proxy/%s/wait' % (self.host, self.port),
        {'quietPeriodInMs': quiet_period, 'timeoutInMs': timeout})
    return r.status_code

Answer 4

如果页面仍然无限期加载，我猜测readyState永远不会达到“完成”。如果您使用的是Firefox，则可以通过调用window.stop()强制暂停页面加载：

try:
    driver.get(url)
    WebDriverWait(driver, 30).until(readystate_complete)
except TimeoutException:
    d.execute_script("window.stop();")

Answer 5

以下是Tommy Beadle提出的解决方案（使用staleness方法）：

import contextlib
from selenium.webdriver import Remote
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of

class MyRemote(Remote):
    @contextlib.contextmanager
    def wait_for_page_load(self, timeout=30):
        old_page = self.find_element_by_tag_name('html')
        yield
        WebDriverWait(self, timeout).until(staleness_of(old_page))

可靠地检测页面加载或超时，Selenium 2

5 个答案: