Question

我很确定这是一个网站特定的东西，因为我已经在其他网站上尝试过我的代码（修改了 xpath）并且它有效。我正在尝试在代码行中获取所列网站上的所有 PDF 链接。

<块引用>

driver.find_elements_by_xpath(xpath) 产生空列表 []

代码：

def scrape_url(url):
    
    xpath = '//*[@class="panel-body"]//a'
    
    options = Options()
    options.headless = True
    # change filepath of chromedriver
    driver = webdriver.Chrome(options=options, executable_path=r'C:\Users\User\Desktop\chromedriver')
    
    try:
        driver.get(url)
        all_href_elements = driver.find_elements_by_xpath(xpath)
        print("all_href_elements", all_href_elements) # <--empty list []
        for href_element in all_href_elements:
            article_url_text = href_element.text
            print(article_url_text)
            if article_url_text == "PDF":
                article_url = href_element.get_attribute('href')
                print(article_url_text, article_url)
                if article_url:
                    self.urls.add(article_url)
                    
        print("num of urls", len(self.urls))
            
    except Exception as e:
        print(e)
        print(url)

url = 'https://www.govinfo.gov/committee/senate-armedservices?path=/browsecommittee/chamber/senate/committee/armedservices/collection/BILLS/congress/106'

scrape_url(url)

但是使用 Chrome 扩展 XPath Helper，XPath 查询应该返回一些东西。我认为这可能是由于 url 是动态的，并且在“打开”窗格之前不会生成。但是 url 应该要求窗格“打开”以便网络驱动程序获取，不是吗？

我该如何解决这个问题？

谢谢

Answer 1

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

只需使用显式等待元素：

    all_href_elements = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.XPATH,xpath))
    )

Python Selenium XPath无法从网站获取网址

1 个答案: