Selenium Web抓取没有ID或类名的嵌套div

时间:2020-09-02 04:03:16

标签: python selenium xpath css-selectors webdriverwait

我正在尝试使用硒从嵌套的HTML表中获取产品名称和数量。我的问题是某些div没有任何ID或类名。我尝试访问的表是重要产品列表。这是我所做的,但是我似乎对如何获取嵌套的div感到迷茫。 该网站位于代码中。

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
page = driver.page_source
driver.quit()


html_soup = BeautifulSoup(page, 'html.parser')
item_containers = html_soup.find_all('div', class_='critical-products-title hide-mobile')

if item_containers:
    for item in item_containers:
       for link in item.findAll('a', ) # need to loop the inner divs to reach the href and then get to the left and right classes to get title and quantity
        print(item)

这是检查的图像。我希望能够遍历所有div并获得标题和数量。

enter image description here

4 个答案:

答案 0 :(得分:1)

您不需要漂亮的汤,也不需要保存page_source。 我使用CSS选择器选择表中的所有目标行,然后应用列表推导选择每行的左侧和右侧。我将结果输出到元组列表。

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)

elements = driver.find_elements_by_css_selector('#app > div:nth-child(1) > div.header-wrapper > div.header-right > div.critical-product-table-container > div.table.shorten.hide-mobile > div > div > div > a > div')

targetted_values = [(element.find_element_by_css_selector('.line-item-left').text, element.find_element_by_css_selector('.line-item-right').text) for element in elements]

driver.quit()

目标值的输出示例:

[('Surgical & Reusable Masks', '376,713,363 available'),
('Disposable Gloves', '66,962,093 available'),
('Gowns and Coveralls', '40,502,145 available'),
('Respirators', '22,189,273 available'),
('Surface Wipes', '20,650,831 available'),
('Face Shields', '16,535,686 available'),
('Hand Sanitizer', '11,152,890 L available'),
('Thermometers', '8,457,993 available'),
('Testing Kits', '2,110,815 available'),
('Surface Solutions', '107,452 L available'),
('Protective Barriers', '10,833 available'),
('Ventilators', '410 available')]

答案 1 :(得分:1)

要打印visibility_of_all_elements_located()所需的WebDriverWait的产品名称和数量,可以使用以下任一Locator Strategies

  • 使用CSS_SELECTOR text 属性:

    driver.get('https://www.rrpcanada.org/#/')
    items =  [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-title")))]
    quantities =  [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-bold.available")))]
    for i,j in zip(items,quantities):
      print(i, j)
    
  • 使用XPATHget_attribute("innerHTML")

    driver.get('https://www.rrpcanada.org/#/')
    items =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='table shorten hide-mobile']/div//div[@class='line-item-title']")))]
    quantities =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='table shorten hide-mobile']/div//div[@class='line-item-bold available']")))]
    for i,j in zip(items,quantities):
      print(i, j)
    
  • 控制台输出:

    Surgical & Reusable Masks  376,713,363 available
    Disposable Gloves  66,962,093 available
    Gowns and Coveralls  40,502,145 available
    Respirators  22,189,273 available
    Surface Wipes  20,650,831 available
    Face Shields  16,535,686 available
    Hand Sanitizer  11,152,890 L available
    Thermometers  8,457,993 available
    Testing Kits  2,110,815 available
    Surface Solutions  107,452 L available
    Protective Barriers  10,833 available
    Ventilators  410 available
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

您可以在How to retrieve the text of a WebElement using Selenium - Python

中找到相关的讨论

Outro

链接到有用的文档:

答案 2 :(得分:0)

您必须使用相对xpath来找到带有class="line-item-left"的元素作为每个项目的名称,并找到带有class="line-item-right"的元素作为可用项目的数量。

driver.find_elements_by_class_name("line-item-left") //Item names
driver.find_elements_by_class_name("line-item-right") //Number of items available

请注意元素 s

中的“ s”

答案 3 :(得分:0)

这是 product name 的选择器:

div.critical-product-table-container div.line-item-left

对于 total

div.critical-product-table-container div.line-item-right

但是下面的方法没有BeautifulSoup

time.sleep(...)是错误的做法,请改用WebDriverWait

并结合上述两个变量并执行并行循环,我尝试使用zip()函数:

url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
wait = WebDriverWait(driver, 150)
product_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-left')))
totals = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-right')))

for product_name, total in zip(product_names, totals):
    print(product_name.text +'--' +total.text)
    
driver.quit()

您需要进行以下导入:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
相关问题