如何在selenium驱动程序中获取整页的innerHTML?

时间:2016-03-10 00:42:10

标签: selenium

我正在使用selenium点击我想要的网页,然后使用Beautiful Soup解析网页。

有人展示了how to get inner HTML of an element in a Selenium WebDriver。有没有办法获取整个页面的HTML?感谢

Python中的示例代码 (根据上面的帖子,语言似乎并不重要):

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup


url = 'http://www.google.com'
driver = webdriver.Firefox()
driver.get(url)

the_html = driver---somehow----.get_attribute('innerHTML')
bs = BeautifulSoup(the_html, 'html.parser')

3 个答案:

答案 0 :(得分:24)

获取整个页面的HTML:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("http://stackoverflow.com")

html = driver.page_source

获取外部HTML(包含标签):

# HTML from `<html>`
html = driver.execute_script("return document.documentElement.outerHTML;")

# HTML from `<body>`
html = driver.execute_script("return document.body.outerHTML;")

# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].outerHTML;", element)

# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('outerHTML')

要获取内部HTML(标记已排除):

# HTML from `<html>`
html = driver.execute_script("return document.documentElement.innerHTML;")

# HTML from `<body>`
html = driver.execute_script("return document.body.innerHTML;")

# HTML from element with some JavaScript
element = driver.find_element_by_css_selector("#hireme")
html = driver.execute_script("return arguments[0].innerHTML;", element)

# HTML from element with `get_attribute`
element = driver.find_element_by_css_selector("#hireme")
html = element.get_attribute('innerHTML')

答案 1 :(得分:0)

使用页面对象:

@FindBy(xpath = "xapth")
private WebElement element;

public String getInnnerHtml() {
    System.out.println(waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML"));
    return waitUntilElementToBeClickable(element, 10).getAttribute("innerHTML")
}

答案 2 :(得分:0)

driver.page_source可能已过时。以下为我工作

let html = await driver.getPageSource();

参考:https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/ie_exports_Driver.html#getPageSource