通过Selenium刮擦动态内容?

时间:2016-04-21 19:57:31

标签: javascript html python-2.7 selenium web-scraping

我正在尝试从Blog到Selenium废弃动态内容,但它总是会返回未呈现的JavaScript。

为了测试这种行为,我试着等到iframe完全加载并打印出内容,打印得很好但是当我回到父框架时它只显示未呈现的JavaScript。

我正在寻找能够打印完全呈现的HTML内容的内容

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions

driver = webdriver.Chrome("path to chrome driver")   
driver.get('http://justgivemechocolateandnobodygetshurt.blogspot.com/')

WebDriverWait(driver, 40).until(expected_conditions.frame_to_be_available_and_switch_to_it((By.ID, "navbar-iframe")))

# Rendered iframe HTML is printed.
content = driver.page_source
print content.encode("utf-8")

# When I switch back to parent frame it again prints non rendered JavaScript.
driver.switch_to.parent_frame()
content = driver.page_source
print content.encode("utf-8")

1 个答案:

答案 0 :(得分:3)

问题是 - the .page_source works only in the current context。有"current top-level browsing context"符号。意思是,如果您在默认内容上调用它 - 您将无法获得子iframe元素的内部HTML - 因为您必须切换到上下文frame并致电.page_source

换句话说,要获得包含iframe页面源的页面的完整HTML,您必须逐个切换到iframe上下文并单独获取源。

另见:

旧回答:

我会wait在获取page_source之前至少要加载一个博客条目:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 40)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".entry-content")))

print(driver.page_source)