我很确定这是一个网站特定的东西,因为我已经在其他网站上尝试过我的代码(修改了 xpath)并且它有效。我正在尝试在代码行中获取所列网站上的所有 PDF 链接。
<块引用>driver.find_elements_by_xpath(xpath) 产生空列表 []
代码:
def scrape_url(url):
xpath = '//*[@class="panel-body"]//a'
options = Options()
options.headless = True
# change filepath of chromedriver
driver = webdriver.Chrome(options=options, executable_path=r'C:\Users\User\Desktop\chromedriver')
try:
driver.get(url)
all_href_elements = driver.find_elements_by_xpath(xpath)
print("all_href_elements", all_href_elements) # <--empty list []
for href_element in all_href_elements:
article_url_text = href_element.text
print(article_url_text)
if article_url_text == "PDF":
article_url = href_element.get_attribute('href')
print(article_url_text, article_url)
if article_url:
self.urls.add(article_url)
print("num of urls", len(self.urls))
except Exception as e:
print(e)
print(url)
url = 'https://www.govinfo.gov/committee/senate-armedservices?path=/browsecommittee/chamber/senate/committee/armedservices/collection/BILLS/congress/106'
scrape_url(url)
但是使用 Chrome 扩展 XPath Helper,XPath 查询应该返回一些东西。我认为这可能是由于 url 是动态的,并且在“打开”窗格之前不会生成。但是 url 应该要求窗格“打开”以便网络驱动程序获取,不是吗?
我该如何解决这个问题?
谢谢
答案 0 :(得分:1)
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
只需使用显式等待元素:
all_href_elements = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH,xpath))
)