我正在努力从网站上提取一些数据,我可以成功浏览到前一天列出所有更新数据的页面,但现在我需要遍历所有链接,并将每个页面的源保存到一个文件。
一旦进入文件,我想使用BeautifulSoup更好地安排数据,以便我可以解析它。
#learn.py
from BeautifulSoup import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url1 = 'https://odyssey.tarrantcounty.com/default.aspx'
date = '07/31/2014'
option_by_date = "6"
driver = webdriver.Firefox()
driver.get(url1)
continue_link = driver.find_element_by_partial_link_text('Case')
#follow link
continue_link.click()
driver.find_element_by_xpath("//select[@name='SearchBy']/option[text()='Date Filed']").click()
#fill in dates in form
from_date = driver.find_element_by_id("DateFiledOnAfter")
from_date.send_keys(date)
to_date = driver.find_element_by_id("DateFiledOnBefore")
to_date.send_keys(date)
submit_button = driver.find_element_by_id('SearchSubmit')
submit_button.click()
link_list = driver.find_elements_by_partial_link_text('2014')
link_list应该是适用链接的列表,但我不确定从那里开始。
答案 0 :(得分:0)
获取href
属性为CaseDetail.aspx?CaseID=
的所有链接,find_elements_by_xpath()
会对此有所帮助:
# get the list of links
links = [link.get_attribute('href')
for link in driver.find_elements_by_xpath('//td/a[starts-with(@href, "CaseDetail.aspx?CaseID=")]')]
for link in links:
# follow the link
driver.get(link)
# parse the data
print driver.find_element_by_class_name('ssCaseDetailCaseNbr').text
打印:
Case No. 2014-PR01986-2
Case No. 2014-PR01988-1
Case No. 2014-PR01989-1
...
请注意,您无需保存页面并通过BeautifulSoup
解析它们。 <{1}}本身在网页navigating and extracting the data中非常强大。
答案 1 :(得分:0)
您可以使用其标记名称获取网络元素。如果你想获取网页中的所有链接,我会使用find_elements_by_tag_name()。
links = driver.find_elements_by_tag_name('a')
link_urls = [link.get_attribute('href') for link in links]
source_dict = dict()
for url in link_urls:
driver.get(url)
source = driver.page_source #this will give you page source
source_dict[url] = source
#source_dict dictionary will contain the source code you wanted for each url with the url as the key.