提取父母和子女的信息

时间:2018-03-12 16:59:06

标签: python beautifulsoup

使用Python和beautifulsoup,我需要帮助同时从父div和子div中提取信息。

这是第一个示例代码:

<div id="slide-609becd056bb40a7ad42607a4d1c67f5" 
class="slide has-link slick-slide" 
data-label="April 2 2018 Acura TLX Offer 2000x700.jpg" 
data-link="/new-inventory/index.htm?model=TLX&amp;year=2018" data-target="_self" 
style="background-image: url(&quot;https://pictures.dealer.com/a/adw/0877/5eabcb338dc604c09b28a4df5a49ad78x.jpg?impolicy=resize&amp;h=514&quot;); 
width: 1897px; position: relative; left: 0px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;" data-slick-index="0" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide00">

以下是示例代码2:

<div id="slide-7ae8b29ddc9e45d1a219beffe5793b2b"
class="html-slide slide slick-slide" 
data-label="March-Madness.jpg" data-link="" data-target="" 
data-promo-id="" data-slick-index="2" aria-hidden="true" tabindex="-1" role="option" 
aria-describedby="slick-slide02" 
style="width: 1897px; position: relative; left: -3794px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;">
    <div class="slide-background" 
    style="background-image: linear-gradient(rgba(0, 0, 0, 0), rgba(0, 0, 0, 0)), url(&quot;https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&amp;h=514&quot;); height: 514px;">
        <img src="https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&amp;h=514" class="placeholder-image pull-left">                                                                  </div>

我需要从两个代码示例中获取style元素,以便我可以获取背景图片网址。问题是第一个代码在父div中有style,第二个代码在子div中有style。如何使用Python和beautifulsoup同时获取这两个style元素?

以下是我尝试的代码:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.goodsonacura.com/'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
banner_info = page_soup.findAll('div',{'class':['slide has-link', 'html-slide slide has-link']})
picture = [banner.get('style') for banner in banner_info]

此代码为第一个示例代码提供了正确的style元素,但它为第二个示例代码提供了错误的style元素。

1 个答案:

答案 0 :(得分:0)

find_all查询中添加“slide-background”类。请参阅以下示例: -

banner_info = page_soup.find_all('div',{'class':['slide has-link', 'html-slide slide has-link', 'slide-background']})

它对我有用。愿这对你有帮助。