从woocommerce网上商店刮下产品尺寸的问题

时间:2018-12-01 22:58:43

标签: python selenium web-scraping

以某种方式,我的网络刮板无法抓住产品尺寸。 HTML:

<div class="woodmart-tab-wrapper">
    <a href="#tab-additional_information" class="woodmart-accordion-title tab-title-additional_information">Additional Information</a>
    <div class="woocommerce-Tabs-panel woocommerce-Tabs-panel--additional_information panel entry-content wc-tab" id="tab-additional_information">
    <div class="wc-tab-inner ">
    <div class="">
    <table class="shop_attributes">
    <tr>## Heading ##
    <th>Size</th>
    <td class="product_dimensions">32 x 24 x 10 cm</td>
    </tr>

我想抓住“ 32 x 24 x 10厘米”。我的代码:我尝试通过css_selectors,rel xpath和abs xpath进行抓取,似乎没有任何作用。

dimensions = ''
    try:
        dimensions = driver.find_element_by_css_selector(
            '.product_dimensions').text
    except Exception as e:
        dimensions = '-'

 dimensions = ''
    try:
        dimensions = driver.find_element_by_xpath(
            "//td[contains(@class,'product_dimensions')]").text
    except Exception as e:
        dimensions = '-'

没有产品尺寸时的输出为:

dimensions: -

但是当产品有尺寸时,输出就是:

dimensions:

2 个答案:

答案 0 :(得分:2)

您需要点击additional infozusätzliche信息 )标签以访问该元素的值。

使用CSS选择器:

from selenium import webdriver

url = 'https://designerparadies.de/produkt/schultertasche-trunk-aus-leder/'
d = webdriver.Chrome()
d.get(url)
d.find_element_by_css_selector('[href*=additional_information]').click()
print(d.find_element_by_css_selector('.product_dimensions').text)
d.quit()

使用xpath:

d.find_element_by_xpath("//*[contains(@class, 'additional_information_tab')]").click()

“其他信息”标签:

enter image description here

答案 1 :(得分:0)

如我所见,您正在使用Selenium。有没有理由不使用bs4(“美丽的汤”)或任何其他Web抓取模块?

如果您需要绕过某种JavaScript挑战或其他事情,我强烈建议您:

  1. 使用Selenium
  2. 获取HTML源代码
  3. 使用Beautiful Soup模块提取所需的信息

据我所知,每当我需要对任何个人项目进行一些Web抓取操作时,我通常会发现Beautiful Soup更加易于使用,并且有据可查(与Selenium相对)

这里有一个符合您要求的示例程序

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup

options = Options()
# Use --headless in order to hide the browser window
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)

# get the page and obtain it's source
driver.get("http://example.com/woocom")
source = driver.page_source

# Use BeautifulSoup to create and Object which contains
# every element in the webpage
page_object = BeautifulSoup(source , features="html.parser")

# If there is more one td with the "product_dimensions" class, we want to
# get everyone and then loop over them to get their text
dimensions = []
product_dimensions = page_object.findall("td", class_= "product_dimensions")
for element in product_dimensions:
    dimensions.append(element.get_text())

# If there is only one td with the "product_dimensions" class, then use "find" instead
# of "findall"
product_dimensions = page_object.find("td", class_= "product_dimensions").get_text()

如果您不需要绕过任何JavaScript或类似的脚本,只需将driver.get("http://example.com/woocom")替换为requests.get("http://example.com/woocom")(请记住要导入requests库并删除source = driver.page_source,因为您不需要它,因为requests.get()本身会返回页面源)

我希望这会有所帮助,但是,在问一些问题时,请尝试提供尽可能多的信息,以帮助其他人回答您