Question

我对Web抓取非常陌生，一直在尝试使用Selenium的功能来模拟浏览器，该浏览器访问Texas公共合同网页，然后下载嵌入式PDF。网站是这样的：http://www.txsmartbuy.com/sp。

到目前为止，我已经成功地使用Selenium在下拉菜单“机构名称”之一中选择一个选项并单击搜索按钮。我在下面列出了我的Python代码。

import os
os.chdir("/Users/fsouza/Desktop") #Setting up directory

from bs4 import BeautifulSoup #Downloading pertinent Python packages
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

chromedriver = "/Users/fsouza/Desktop/chromedriver" #Setting up Chrome driver
driver = webdriver.Chrome(executable_path=chromedriver)
driver.get("http://www.txsmartbuy.com/sp")
delay = 3 #Seconds

WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "//select[@id='agency-name-filter']/option[69]")))    
health = driver.find_element_by_xpath("//select[@id='agency-name-filter']/option[68]")
health.click()
search = driver.find_element_by_id("spBtnSearch")
search.click()

一旦进入结果页面，我就会被卡住。

首先，我无法使用html页面源访问任何结果链接。但是，如果我手动检查Chrome中的各个链接，则会找到与各个结果相关的相关标签（<a href...）。我猜这是因为JavaScript呈现的内容。

第二，即使Selenium能够看到这些单独的标签，它们也没有类别或ID。我认为，调用它们的最佳方法是按照显示的顺序调用<a标签（请参见下面的代码），但这也不起作用。而是，该链接调用其他“可见”标签（在页脚中，我不需要）。

第三，假设这些方法确实起作用，我如何确定页面上显示的<a>标签的数量（为了使此代码遍历每个结果）？

driver.execute_script("document.getElementsByTagName('a')[27].click()")

感谢您对此的关注-并请原谅我的任何愚蠢，因为我才刚刚起步。

Answer 1

要使用Selenium抓取JavaScript呈现的内容，您需要：

为所需的element to be clickable()引入 WebDriverWait 。
为visibility of all elements located()生成 WebDriverWait 。
使用 Ctrl 和new tab到click()

ActionChains

诱导 WebDriverWait 和switch to the new tab进行网络抓取。
切换回主页。

代码块：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("http://www.txsmartbuy.com/sp")
windows_before  = driver.current_window_handle
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']"))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']//option[contains(., 'Health & Human Services Commission - 529')]"))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@id='spBtnSearch']/i[@class='icon-search']"))).click()
for link in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/tbody//tr/td/strong/a"))):
    ActionChains(driver).key_down(Keys.CONTROL).click(link).key_up(Keys.CONTROL).perform()
    WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
    windows_after = driver.window_handles
    new_window = [x for x in windows_after if x != windows_before][0]
    driver.switch_to_window(new_window)
    time.sleep(3)
    print("Focus on the newly opened tab and here you can scrape the page")
    driver.close()
    driver.switch_to_window(windows_before)
driver.quit()

控制台输出：

Focus on the newly opened tab and here you can scrape the page
Focus on the newly opened tab and here you can scrape the page
Focus on the newly opened tab and here you can scrape the page
.
.

浏览器快照：

Answer 2

要获取您在结果中表示的<a>标签，请使用以下xpath：

//tbody//tr//td//strong//a

单击search按钮后，可以循环提取它们。首先，您需要找到.visibility_of_all_elements_located中的所有元素：

search.click()

elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//tbody//tr//td//strong//a")))

print(len(elements))

for element in elements:
    get_text = element.text 
    print(get_text)
    url_number = element.get_attribute('onclick').replace('window.open("/sp/', '').replace('");return false;', '')
    get_url = 'http://www.txsmartbuy.com/sp/' +url_number
    print(get_url)

结果之一：

IFB HHS0006862，毯子，圣安吉洛食堂转售。 529-96596。 http://www.txsmartbuy.com/sp/HHS0006862

使用Python中的Selenium Web抓取JavaScript呈现的内容

2 个答案: