Question

我正在编写一个工具，他的一个动作应该是分析网页来源。我使用Selenium for Python和Firefox驱动程序。当我尝试使用webdriver.page_source命令获取页面的源代码时，我获得的源代码与我从常规代码获得的源代码不同（在浏览器中右键单击 - >页面源代码）。我使用挂钩到应该向页面添加文本的浏览器（我在常规页面源中看到了该文本，但无法通过selenium看到它）

例如：

浏览器的源代码：

<html>
  <head></head>
  <body>
    <title>Title</title>
    <h1>Test Page</h1>
    <div>THIS DIV INJECTED TO THE BROWSER</div>
  </body>
</html>

Selenium的源代码：

<html xmlns="http://www.w3.org/1999/xhtml">
  <head></head>
  <body>
    <title>Title</title>
    <h1>Test Page</h1>
  </body>
</html>

我看了一篇类似的帖子here，但那里的答案并不相关。

请注意，我需要源代码本身，而不是呈现的代码（我使用webdriver.execute_script获得的代码。

如何获取常规源代码？

Answer 1

这里最可能出现的问题是等待问题 - 当页面没有完全加载时，你获取页面源。解决问题的最佳方法是添加explicit wait以等待特定元素出现/可见：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebdriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "myid")))

print(driver.page_source)

selenium和浏览器本身之间的不同页面源代码

1 个答案: