如何在导航到每个链接时保存html源代码

时间:2017-07-08 21:35:45

标签: python python-2.7 python-3.x selenium selenium-webdriver

这是我的代码

driver = webdriver.Chrome()
path = "/home/winpc/test/python/dup/new"
def get_link_urls(url,driver):
    driver.get(url)
    url = urllib.urlopen(url)
    content = url.readlines() 
    urls = []
    for link in driver.find_elements_by_tag_name('a'):
        elem = driver.find_element_by_xpath("//*")
        source_code = elem.get_attribute("outerHTML")
        test = link.get_attribute('href')
        if str(test) != 'None':
               file_name=test.rsplit('/')[-1].split('.')[0]
               file_name_formated = file_name + "Copy.html"
               with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
                    temp_file.write(source_code.encode('utf-8'))
        urls.append(link.get_attribute('href'))
    return urls

urls = get_link_urls("http://localhost:8080",driver)
sub_urls = []
for url in urls:
    if str(url) != 'None':
        sub_urls.extend(get_link_urls(url,driver))

此代码正确导航每个链接,但始终只有第一个html页面。我需要保存每个页面的源代码navigating.saving部分正在使用下面的代码:

file_name_formated = file_name + "Copy.html"
with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
                temp_file.write(source_code.encode('utf-8'))

1 个答案:

答案 0 :(得分:0)

首先,你要在函数中一次又一次地覆盖URL,所以要修复那个。

要通过selenium保存页面源,您可以使用driver.page_source

此外,如果您希望此代码更快,请考虑使用请求模块。

response = requests.get(url).content