Question

我正在学习网络抓取，并尝试使用selenium＆bs4构建一个脚本，该脚本从aliexpress产品中抓取数据。使用-https://www.aliexpress.com/item/33046358386.html

的示例产品

尝试使用以下方式抓取产品详细信息

details = soup.find("div", {"class": "product-detail-tab"})

但仅返回

<div class="product-detail-tab">
    <div class="lazyload-placeholder" style="height: 1000px;"></div>
</div>

即使我通过检查网页上是否有更多代码来查看网页。

试图以这种方式找到div，但并没有改变结果

details = browser.find_elements_by_xpath('//div[@class="product-detail-tab"]')

我的完整抓取代码：

import pandas
from bs4 import BeautifulSoup
from selenium import webdriver

product_id = "33046358386"

page_url = f"https://www.aliexpress.com/item/{product_id}.html"
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')
browser.get(page_url)
page_html = browser.page_source
soup = BeautifulSoup(page_html, 'html.parser');

product_name = soup.find("h1", {"class": "product-title-text"}).text.strip()
product_price = soup.find("span", {"class": "product-price-value"}).text.strip()
shipping_price = soup.find("div", {"class": "product-shipping-price"}).span.text.strip()
details = soup.find("div", {"class": "product-detail-tab"})

print(product_name)
print(product_price)
print(shipping_price)
print(description)

browser.close()

很高兴听到这里是什么问题。

Answer 1

抱歉，我无法发表评论，我的代表人数不足。我对这些东西还是陌生的，但我想我可以为您指明正确的方向。 Sri是正确的，这是由于动态加载造成的。如果加载了JavaScript（在这里似乎是这种情况），则可以将硒与PhantomJS一起使用以获取源代码。请查看下面的示例。

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.aliexpress.com/item/33046358386.html"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('section', 'wrapper')

Answer 2

正如Sri指出的那样，您在处理html时没有给动态加载的元素时间去处理（顺便说一下，在后台似乎有一个API调用-可能比硒处理要好）。我想正确的处理方式应该是这样

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
...

browser.get(page_url)
MAX_WAIT_TIME = 60
wait = WebDriverWait(browser, MAX_WAIT_TIME)
element = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.product-info")))
page_html = browser.page_source

如果由于某种原因而失败，您可能会捕获一些TimeoutException，但它需要更多的重构。

使用硒进行网络抓取时无法访问div内容

2 个答案: