这是我的代码
driver = webdriver.Chrome()
path = "/home/winpc/test/python/dup/new"
def get_link_urls(url,driver):
driver.get(url)
url = urllib.urlopen(url)
content = url.readlines()
urls = []
for link in driver.find_elements_by_tag_name('a'):
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
test = link.get_attribute('href')
if str(test) != 'None':
file_name=test.rsplit('/')[-1].split('.')[0]
file_name_formated = file_name + "Copy.html"
with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
temp_file.write(source_code.encode('utf-8'))
urls.append(link.get_attribute('href'))
return urls
urls = get_link_urls("http://localhost:8080",driver)
sub_urls = []
for url in urls:
if str(url) != 'None':
sub_urls.extend(get_link_urls(url,driver))
此代码正确导航每个链接,但始终只有第一个html页面。我需要保存每个页面的源代码navigating.saving部分正在使用下面的代码:
file_name_formated = file_name + "Copy.html"
with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
temp_file.write(source_code.encode('utf-8'))
答案 0 :(得分:0)
首先,你要在函数中一次又一次地覆盖URL,所以要修复那个。
要通过selenium保存页面源,您可以使用driver.page_source
此外,如果您希望此代码更快,请考虑使用请求模块。
response = requests.get(url).content