使用硒提取网站的面包屑

时间:2019-09-08 13:16:59

标签: selenium xpath web-scraping css-selectors webdriverwait

我需要提取以下网站的面包屑:https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas

我试图检查元素并复制xpath,但它没有提取

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas')
driver.find_elements_by_xpath('//*[@id="center-panel"]/div/wow-tile-list-with-content/ng-transclude/wow-browse-tile-list/wow-tile-list/div/div[1]/div[1]/wow-breadcrumbs/div/ul/li[4]/span/span')

driver.find_element_by_css_selector('#center-panel > div > wow-tile-list-with-content > ng-transclude > wow-browse-tile-list > wow-tile-list > div > div.tileList > div.tileList-headerContainer > wow-breadcrumbs > div > ul > li:nth-child(4) > span > span')

我该如何进行?

3 个答案:

答案 0 :(得分:1)

您要抓取的页面是用Angular编写的,这意味着大多数DOM element是由JavaScript AJAX代码动态加载的,并且一旦页面加载就不存在。 (driver.get函数返回)

您应该使用waits until函数来查找此类元素。

以下是使用您提供的XPATH的有效示例:

driver.get('https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas')
try:
    element = WebDriverWait(driver, 1).until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="center-panel"]/div/wow-tile-list-with-content/ng-transclude/wow-browse-tile-list/wow-tile-list/div/div[1]/div[1]/wow-breadcrumbs/div/ul/li[4]/span/span'))
    )
    print(element.text) ' this outputs Iced Teas
except TimeoutException:
    print("Timeout")

答案 1 :(得分:1)

要打印网站的面包屑:https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas,必须诱使 WebDriverWait 成为所需的visibility_of_element_located(),并且可以使用以下任一Locator Strategies

  • 使用CSS_SELECTORget_attribute()方法:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.breadcrumbs-linkList li:nth-child(4) span span"))).get_attribute("innerHTML"))
    
  • 使用XPATHtext属性:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//ul[@class='breadcrumbs-linkList']//following-sibling::li[4]//span//span"))).text)
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

Outro

根据文档:

答案 2 :(得分:0)

下面一个适用于我的验证

//*[span='first text' and span='Search results for "second text"']