Question

我正在做一个网页抓取项目。我正在抓取的网站的 URL 是 https://www.beliani.de/sofas/ledersofa/

我正在抓取此页面上列出的所有产品链接。我尝试使用 Requests-HTML 和 Selenium 获取链接。但是我分别得到了 57 个和 24 个链接。虽然页面上列出了 150 多种产品。下面是我正在使用的代码块。

使用硒：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")

#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)

url = 'https://www.beliani.de/sofas/ledersofa/'

driver.get(url)
sleep(20)

links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
    print(a)
    links.append(a)
print(len(links))

使用请求 HTML：

from requests_html import HTMLSession

url = 'https://www.beliani.de/sofas/ledersofa/'

s = HTMLSession()
r = s.get(url)

r.html.render(sleep = 20)

products = r.html.xpath('//*[@id="offers_div"]', first = True)

#Getting 57 links using below block:
links = []
for link in products.absolute_links:
    print(link)
    links.append(link)

print(len(links))

我不知道我做错了哪一步或遗漏了什么。

Answer 1

您必须滚动浏览网站并到达页面末尾才能加载网页中的所有脚本。只需打开网站，我们将仅加载查看网页特定部分所需的脚本。因此，当您运行代码时，它只能从那些已加载的脚本中检索数据。

这个给了我 160 个链接：

driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)

#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')

# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
    scroll_height = scroll_height + (height/10)
    driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
    sleep(2)

# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
    if i.get_attribute('href') is not None:
        print(i.get_attribute('href'))
        count+=1

print(count)
driver.quit()

Answer 2

要使用 Selenium 和 python 提取链接总数，您需要接受 cookie，并且必须为 visibility_of_all_elements_located() 诱导 WebDriverWait，您可以使用其中的任何一个以下Locator Strategies：

使用 CSS_SELECTOR：

driver.get("https://www.beliani.de/sofas/ledersofa/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[value='Akzeptieren']"))).click()
print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#offers_div > div > div > a[href]")))))

使用 XPATH：

driver.get("https://www.beliani.de/sofas/ledersofa/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@value='Akzeptieren']"))).click()
print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='offers_div']/div/div/a[@href]")))))

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

无法从网页获取所有链接

2 个答案: