如何使用Beautifulsoup从`<ul>`获取第一个`<li>`的内容

时间:2017-06-07 03:33:43

标签: python html beautifulsoup

HTML如下

<div class="carousel"> 
  <div class="carousel_Wrapper"> 
    <div class="carousel_Container swiper-container"> 
      <ul class="swiper-wrapper">
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0001.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0002.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0003.jpg"/></figure>
        </li>
      </ul>
    </div>
    <div class="carousel_NextBtn"></div> 
    <div class="carousel_PrevBtn"></div> 
  </div> 
</div>

<div class="carousel"> 
  <div class="carousel_Wrapper"> 
    <div class="carousel_Container swiper-container"> 
      <ul class="swiper-wrapper">
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0004.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0005.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0006.jpg"/></figure>
        </li>
      </ul>
    </div>
    <div class="carousel_NextBtn"></div> 
    <div class="carousel_PrevBtn"></div> 
  </div> 
</div>

我想使用BeautifulSoup更改为HTML,如下所示。

<figure><img alt="" src="https://s3.amazonaws.com/0001.jpg"/></figure>
<p><a href="https://xxxx.jp">other photos</a></p>

<figure><img alt="" src="https://s3.amazonaws.com/0004.jpg"/></figure>
<p><a href="https://xxxx.jp">other photos</a></p>

我正在考虑以下列方式删除不必要的内容。
由于可能还有其他的s,我们正在指定类并执行decoponse(),unwrap()。

html = # First mentioned html

content = BeautifulSoup(html)

content.find('div', class_='carousel_NextBtn').decompose()
content.find('div', class_='carousel').unwrap()
content.find('div', class_='carousel_Wrapper').unwrap()
content.find('div', class_='carousel_Container swiper-container').unwrap()

在应用上述处理时,我认为将生成如下所示的html。

<ul class="swiper-wrapper">
  <li class="swiper-slide"> 
    <figure><img alt="" src="https://s3.amazonaws.com/0001.jpg"/></figure>
  </li>
  <li class="swiper-slide"> 
    <figure><img alt="" src="https://s3.amazonaws.com/0002.jpg"/></figure>
  </li>
  <li class="swiper-slide"> 
    <figure><img alt="" src="https://s3.amazonaws.com/0003.jpg"/></figure>
  </li>
</ul>
<div class="carousel_PrevBtn"></div> 

<ul class="swiper-wrapper">
  <li class="swiper-slide"> 
    <figure><img alt="" src="https://s3.amazonaws.com/0004.jpg"/></figure>
  </li>
    <li class="swiper-slide"> 
  <figure><img alt="" src="https://s3.amazonaws.com/0005.jpg"/></figure>
  </li>
    <li class="swiper-slide"> 
  <figure><img alt="" src="https://s3.amazonaws.com/0006.jpg"/></figure>
  </li>
</ul>
<div class="carousel_PrevBtn"></div> 

我们认为必要的处理如下所示。

  • 1.检索每个<li>的第一个<ul>元素的内容
  • 2.插入<p><a href="https://xxxx.jp">other photos</a></p>

    对于2,我认为替换没有问题 但我不知道如何实施1.

    请说明解决问题的方法。

1 个答案:

答案 0 :(得分:0)

html = """<div class="carousel"> 
  <div class="carousel_Wrapper"> 
    <div class="carousel_Container swiper-container"> 
      <ul class="swiper-wrapper">
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0001.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0002.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0003.jpg"/></figure>
        </li>
      </ul>
    </div>
    <div class="carousel_NextBtn"></div> 
    <div class="carousel_PrevBtn"></div> 
  </div> 
</div>

<div class="carousel"> 
  <div class="carousel_Wrapper"> 
    <div class="carousel_Container swiper-container"> 
      <ul class="swiper-wrapper">
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0004.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0005.jpg"/></figure>
        </li>
        <li class="swiper-slide"> 
          <figure><img alt="" src="https://s3.amazonaws.com/0006.jpg"/></figure>
        </li>
      </ul>
    </div>
    <div class="carousel_NextBtn"></div> 
    <div class="carousel_PrevBtn"></div> 
  </div> 
</div>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
all_div = soup.find_all('ul', {'class': 'swiper-wrapper'})  # find all ul tag with specified class
for tag in all_div:
    print('-------------------- iteration : ' + str(all_div.index(tag)) + ' --------------------')
    print(tag.find('li', {'class': 'swiper-slide'}))  # this method works only if your item has class
    print(tag.contents[1])  # this method will also work if your item don't have a class

您可以实现“检索每个<li>的第一个<ul>元素的内容”的解决方案,如上面的代码所示。你没有遇到第二个问题,所以我还没有发布它。如果您需要任何帮助,请告诉我。