Question

我正在努力从网站上提取一些数据，我可以成功浏览到前一天列出所有更新数据的页面，但现在我需要遍历所有链接，并将每个页面的源保存到一个文件。

一旦进入文件，我想使用BeautifulSoup更好地安排数据，以便我可以解析它。

#learn.py
from BeautifulSoup import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url1 = 'https://odyssey.tarrantcounty.com/default.aspx'
date = '07/31/2014'
option_by_date = "6"
driver = webdriver.Firefox()
driver.get(url1)
continue_link = driver.find_element_by_partial_link_text('Case')

#follow link
continue_link.click()

driver.find_element_by_xpath("//select[@name='SearchBy']/option[text()='Date Filed']").click()
#fill in dates in form
from_date = driver.find_element_by_id("DateFiledOnAfter")
from_date.send_keys(date)
to_date = driver.find_element_by_id("DateFiledOnBefore")
to_date.send_keys(date)

submit_button = driver.find_element_by_id('SearchSubmit')
submit_button.click()

link_list = driver.find_elements_by_partial_link_text('2014')

link_list应该是适用链接的列表，但我不确定从那里开始。

Answer 1

获取href属性为CaseDetail.aspx?CaseID=的所有链接，find_elements_by_xpath()会对此有所帮助：

# get the list of links
links = [link.get_attribute('href') 
         for link in driver.find_elements_by_xpath('//td/a[starts-with(@href, "CaseDetail.aspx?CaseID=")]')]
for link in links:
    # follow the link
    driver.get(link)

    # parse the data
    print driver.find_element_by_class_name('ssCaseDetailCaseNbr').text

打印：

Case No. 2014-PR01986-2
Case No. 2014-PR01988-1
Case No. 2014-PR01989-1
...

请注意，您无需保存页面并通过BeautifulSoup解析它们。 <{1}}本身在网页navigating and extracting the data中非常强大。

Answer 2

您可以使用其标记名称获取网络元素。如果你想获取网页中的所有链接，我会使用find_elements_by_tag_name（）。

links = driver.find_elements_by_tag_name('a')
link_urls = [link.get_attribute('href') for link in links]
source_dict = dict()
for url in link_urls:
    driver.get(url)
    source = driver.page_source #this will give you page source
    source_dict[url] = source

#source_dict dictionary will contain the source code you wanted for each url with the url as the key.

Python Selenium从find_elements_by_partial_link_text中提取href信息

2 个答案: