Question

以某种方式，我的网络刮板无法抓住产品尺寸。 HTML：

<div class="woodmart-tab-wrapper">
    <a href="#tab-additional_information" class="woodmart-accordion-title tab-title-additional_information">Additional Information</a>
    <div class="woocommerce-Tabs-panel woocommerce-Tabs-panel--additional_information panel entry-content wc-tab" id="tab-additional_information">
    <div class="wc-tab-inner ">
    <div class="">
    <table class="shop_attributes">
    <tr>## Heading ##
    <th>Size</th>
    <td class="product_dimensions">32 x 24 x 10 cm</td>
    </tr>

我想抓住“ 32 x 24 x 10厘米”。我的代码：我尝试通过css_selectors，rel xpath和abs xpath进行抓取，似乎没有任何作用。

dimensions = ''
    try:
        dimensions = driver.find_element_by_css_selector(
            '.product_dimensions').text
    except Exception as e:
        dimensions = '-'

和

 dimensions = ''
    try:
        dimensions = driver.find_element_by_xpath(
            "//td[contains(@class,'product_dimensions')]").text
    except Exception as e:
        dimensions = '-'

没有产品尺寸时的输出为：

dimensions: -

但是当产品有尺寸时，输出就是：

dimensions:

Answer 1

您需要点击additional info（zusätzliche信息 ）标签以访问该元素的值。

使用CSS选择器：

from selenium import webdriver

url = 'https://designerparadies.de/produkt/schultertasche-trunk-aus-leder/'
d = webdriver.Chrome()
d.get(url)
d.find_element_by_css_selector('[href*=additional_information]').click()
print(d.find_element_by_css_selector('.product_dimensions').text)
d.quit()

使用xpath：

d.find_element_by_xpath("//*[contains(@class, 'additional_information_tab')]").click()

“其他信息”标签：

Answer 2

如我所见，您正在使用Selenium。有没有理由不使用bs4（“美丽的汤”）或任何其他Web抓取模块？

如果您需要绕过某种JavaScript挑战或其他事情，我强烈建议您：

使用Selenium
使用Beautiful Soup模块提取所需的信息

据我所知，每当我需要对任何个人项目进行一些Web抓取操作时，我通常会发现Beautiful Soup更加易于使用，并且有据可查（与Selenium相对）

这里有一个符合您要求的示例程序

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup

options = Options()
# Use --headless in order to hide the browser window
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)

# get the page and obtain it's source
driver.get("http://example.com/woocom")
source = driver.page_source

# Use BeautifulSoup to create and Object which contains
# every element in the webpage
page_object = BeautifulSoup(source , features="html.parser")

# If there is more one td with the "product_dimensions" class, we want to
# get everyone and then loop over them to get their text
dimensions = []
product_dimensions = page_object.findall("td", class_= "product_dimensions")
for element in product_dimensions:
    dimensions.append(element.get_text())

# If there is only one td with the "product_dimensions" class, then use "find" instead
# of "findall"
product_dimensions = page_object.find("td", class_= "product_dimensions").get_text()

如果您不需要绕过任何JavaScript或类似的脚本，只需将driver.get("http://example.com/woocom")替换为requests.get("http://example.com/woocom")（请记住要导入requests库并删除source = driver.page_source，因为您不需要它，因为requests.get()本身会返回页面源）

我希望这会有所帮助，但是，在问一些问题时，请尝试提供尽可能多的信息，以帮助其他人回答您

从woocommerce网上商店刮下产品尺寸的问题

2 个答案: