Question

我需要下载html网页images , css , js的全部内容。

第一个选项：

按urllib或requests，
提取页面信息。 beutiful soup或lxml，
下载所有链接和
将原始页面中的链接编辑为相关。

缺点

多个步骤。
下载的页面永远不会与远程页面相同。 may be due to js or ajax content

第二个选项

有些作者建议自动化webbrowser下载页面;所以java scrip和ajax将在下载之前执行。

scraping ajax sites and java script

我想使用此选项。

首次尝试

所以我复制了这段selenium代码来做两步：

在firefox浏览器中打开网址
下载页面。

代码

import os
from selenium import webdriver 
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False )
profile.set_preference('browser.download.dir', os.environ["HOME"])
profile.set_preference("browser.helperApps.alwaysAsk.force", False )
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/html,text/webviewhtml,text/x-server-parsed-html,text/plaintext,application/octet-stream');

browser = webdriver.Firefox(profile)

def open_new_tab(url):
    ActionChains(browser).send_keys(Keys.CONTROL, "t").perform()
    browser.get(url)
    return browser.current_window_handle

# call the function
open_new_tab("https://www.google.com")
# Result: the browser is opened t the given url, no download occur

结果

遗憾的是，没有下载，只需在提供的url处打开浏览器（第一步）。

第二次尝试

我认为通过单独的功能下载页面;所以我添加了这个功能。

添加的功能

def save_current_page():      
    ActionChains(browser).send_keys(Keys.CONTROL, "s").perform()

# call the function
open_new_tab("https://www.google.com")
save_current_page()

结果

# No more; the browser is opened at the given url, no download occurs.

问题如何通过selenium自动下载网页？

使用selenium下载整个html页面内容

0 个答案: