Question

我正尝试抓取以下页面：https://redmart.com/fresh-produce/fresh-vegetables。但是我面临的问题是它只返回一些元素。我使用的代码如下：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver

# Start the WebDriver and load the page
wd = webdriver.Chrome(executable_path=r"C:\Chrome\chromedriver.exe")
wd.get('https://redmart.com/fresh-produce/fresh-vegetables')

# Wait for the dynamically loaded elements to show up
WebDriverWait(wd, 300).until(
EC.visibility_of_element_located((By.CLASS_NAME, "productDescriptionAndPrice")))

# And grab the page HTML source
html_page = wd.page_source
wd.quit()

# Now you can use html_page as you like
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'lxml')
print(soup)

我需要使用Selenium，因为源代码无用，因为该页面是JAVAscript生成的。如果您打开该页面，则该页面有大约60行产品（总共约360种产品）。运行这段代码只会给我6行产品。停在黄洋葱上。

谢谢！

Answer 1

根据您的问题和网站https://redmart.com/fresh-produce/fresh-vegetables，硒可以轻松地抓取所有产品名称。正如您提到的，总共有大约360种产品，但是只有大约35种产品来自特定的 Class ，为此，我为您提供解决方案如下：

代码块：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

item_names = []
options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://redmart.com/fresh-produce/fresh-vegetables")
titles = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='productDescriptionAndPrice']//h4/a")))
for title in titles:
    item_names.append(title.text)
try:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    titles = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='productDescriptionAndPrice']//h4/a")))
    for title in titles:
    item_names.append(title.text)
except:
    pass
for item_name in item_names:
    print(item_name)
driver.quit()

控制台输出：

Eco Leaf Baby Spinach Fresh Vegetable
Eco Leaf Kale Fresh Vegetable
Sustenir Agriculture Almighty Arugula
Sustenir Fresh Toscano Black Kale
Sustenir Fresh Kinky Green Curly Kale
ThyGrace Honey Cherry Tomato
Australian Broccoli
Sustenir Agriculture Italian Basil
GIVVO Japanese Cucumbers
YUVVO Red Onions
Australian Cauliflower
YUVVO Spring Onion
GIVVO Old Ginger
GIVVO Cherry Grape Tomatoes
YUVVO Holland Potato
ThyGrace Traffic Light Capsicum Bell Peppers
GIVVO Whole Garlic
GIVVO Celery
Eco Leaf Baby Spinach Fresh Vegetable
Eco Leaf Kale Fresh Vegetable
Sustenir Agriculture Almighty Arugula
Sustenir Fresh Toscano Black Kale
Sustenir Fresh Kinky Green Curly Kale
ThyGrace Honey Cherry Tomato
Australian Broccoli
Sustenir Agriculture Italian Basil
GIVVO Japanese Cucumbers
YUVVO Red Onions
Australian Cauliflower
YUVVO Spring Onion
GIVVO Old Ginger
GIVVO Cherry Grape Tomatoes
YUVVO Holland Potato
ThyGrace Traffic Light Capsicum Bell Peppers
GIVVO Whole Garlic
GIVVO Celery

注意：您可以构建更强大的 XPATH 或 CSS-SELECTOR 来包含更多 products 并提取相关的产品名称。

Answer 2

以下是Java中一些有效的代码。测试等待30个元素。

@Test
public void test1() {
    List<WebElement> found = new WebDriverWait(driver, 300).until(wd -> {
        List<WebElement> elements = driver.findElements(By.className("productDescriptionAndPrice"));
        if(elements.size() > 30)
            return elements ;
        ((JavascriptExecutor) driver).executeScript("window.scrollTo(0, document.body.offsetHeight)");
        return null;
    });
    for (WebElement e : found) {
        System.out.println(e.getText());
    }
}

Answer 3

您好DebanjanB，感谢您的帮助。我整天都在忙这个。真正的问题是将完整的产品列表添加到源代码中。如果一切都在源中，我认为可以将其提取出来。我相信来源会随着您向下滚动而改变，这也许就是为什么我们所有人只能提取36个项目的原因。

考虑到这一点，我的解决方案如下。这并不完美，因为我稍后必须做进一步处理以删除重复项。如果您有其他想法或可以进一步优化，我将不胜感激。

一般的想法是向下滚动，获取源代码，然后追加并重叠制作1个大的长源代码。我有1400多种产品以这种方式在360个产品页面上使用，这就是为什么我说这是一个不好的解决方案。

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time
from bs4 import BeautifulSoup

# Start the WebDriver and load the page
wd = webdriver.Chrome(executable_path=r"C:\Chrome\chromedriver.exe")
wd.delete_all_cookies()
wd.set_page_load_timeout(30)

wd.get('https://redmart.com/fresh-produce/fresh-vegetables#toggle=all')
time.sleep(5)

html_page = wd.page_source
soup = BeautifulSoup(html_page, 'lxml')

while True:
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(3)
    html_page = wd.page_source
    soup2 = BeautifulSoup(html_page, 'lxml')

    for element in soup2.body:
         soup.body.append(element) 
    time.sleep(2)

    #break condition
    new_height = wd.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
wd.quit()

results = soup.findAll('div', attrs='class':'productDescriptionAndPrice'})
len(results)
results[0] # this tally with the first product
results[-1] # this tallies with the last

说实话，对此解决方案感到非常失望。谢谢，请随时让他们来，让他们来！

如何通过Selenium从网站上抓取产品名称？

3 个答案: