以某种方式,我的网络刮板无法抓住产品尺寸。 HTML:
<div class="woodmart-tab-wrapper">
<a href="#tab-additional_information" class="woodmart-accordion-title tab-title-additional_information">Additional Information</a>
<div class="woocommerce-Tabs-panel woocommerce-Tabs-panel--additional_information panel entry-content wc-tab" id="tab-additional_information">
<div class="wc-tab-inner ">
<div class="">
<table class="shop_attributes">
<tr>## Heading ##
<th>Size</th>
<td class="product_dimensions">32 x 24 x 10 cm</td>
</tr>
我想抓住“ 32 x 24 x 10厘米”。我的代码:我尝试通过css_selectors,rel xpath和abs xpath进行抓取,似乎没有任何作用。
dimensions = ''
try:
dimensions = driver.find_element_by_css_selector(
'.product_dimensions').text
except Exception as e:
dimensions = '-'
和
dimensions = ''
try:
dimensions = driver.find_element_by_xpath(
"//td[contains(@class,'product_dimensions')]").text
except Exception as e:
dimensions = '-'
没有产品尺寸时的输出为:
dimensions: -
但是当产品有尺寸时,输出就是:
dimensions:
答案 0 :(得分:2)
您需要点击additional info
(zusätzliche信息
)标签以访问该元素的值。
使用CSS选择器:
from selenium import webdriver
url = 'https://designerparadies.de/produkt/schultertasche-trunk-aus-leder/'
d = webdriver.Chrome()
d.get(url)
d.find_element_by_css_selector('[href*=additional_information]').click()
print(d.find_element_by_css_selector('.product_dimensions').text)
d.quit()
使用xpath:
d.find_element_by_xpath("//*[contains(@class, 'additional_information_tab')]").click()
“其他信息”标签:
答案 1 :(得分:0)
如我所见,您正在使用Selenium
。有没有理由不使用bs4
(“美丽的汤”)或任何其他Web抓取模块?
如果您需要绕过某种JavaScript挑战或其他事情,我强烈建议您:
Selenium
据我所知,每当我需要对任何个人项目进行一些Web抓取操作时,我通常会发现Beautiful Soup更加易于使用,并且有据可查(与Selenium相对)
这里有一个符合您要求的示例程序
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
options = Options()
# Use --headless in order to hide the browser window
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
# get the page and obtain it's source
driver.get("http://example.com/woocom")
source = driver.page_source
# Use BeautifulSoup to create and Object which contains
# every element in the webpage
page_object = BeautifulSoup(source , features="html.parser")
# If there is more one td with the "product_dimensions" class, we want to
# get everyone and then loop over them to get their text
dimensions = []
product_dimensions = page_object.findall("td", class_= "product_dimensions")
for element in product_dimensions:
dimensions.append(element.get_text())
# If there is only one td with the "product_dimensions" class, then use "find" instead
# of "findall"
product_dimensions = page_object.find("td", class_= "product_dimensions").get_text()
如果您不需要绕过任何JavaScript或类似的脚本,只需将driver.get("http://example.com/woocom")
替换为requests.get("http://example.com/woocom")
(请记住要导入requests
库并删除source = driver.page_source
,因为您不需要它,因为requests.get()
本身会返回页面源)
我希望这会有所帮助,但是,在问一些问题时,请尝试提供尽可能多的信息,以帮助其他人回答您