按标签的内容/文字进行组合

时间:2019-05-09 13:36:31

标签: beautifulsoup

我想分解具有照片或原型文本/内容的h4标签。

这是HTML代码

     <div class='wrap'>
       <div class='col'>
          <h4 class='h4'>photos</h4>
          <h4 class='h4'>videos</h4>
           <h4 class='h4'>prototypes</h4>
            <h4 class='h4'>weight</h4>
        </div>
      <div class='col'>
          <h4 class='h4'>color</h4>
           <h4 class='h4'>selfie</h4>
            <h4 class='h4'>front</h4>
             <h4 class='h4'>back</h4>
       </div>
       </div>

并输出我想要的:

<div class='wrap'>
       <div class='col'>
          <h4 class='h4'>videos</h4>
            <h4 class='h4'>weight</h4>
        </div>
      <div class='col'>
          <h4 class='h4'>color</h4>
           <h4 class='h4'>selfie</h4>
            <h4 class='h4'>front</h4>
             <h4 class='h4'>back</h4>
       </div>
       </div>

2 个答案:

答案 0 :(得分:2)

您可以将a regular expression传递给text中的find_all参数。然后decompose匹配的每个标签。

html_doc="""
<div class='wrap'>
   <div class='col'>
      <h4 class='h4'>photos</h4>
      <h4 class='h4'>videos</h4>
       <h4 class='h4'>prototypes</h4>
        <h4 class='h4'>weight</h4>
    </div>
  <div class='col'>
      <h4 class='h4'>color</h4>
       <h4 class='h4'>selfie</h4>
        <h4 class='h4'>front</h4>
         <h4 class='h4'>back</h4>
   </div>
</div>
"""
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'html.parser')
for tag in soup.find_all('h4',text=re.compile('photos|prototypes')):
    tag.decompose()
print(soup)

输出

<div class="wrap">
<div class="col">

<h4 class="h4">videos</h4>

<h4 class="h4">weight</h4>
</div>
<div class="col">
<h4 class="h4">color</h4>
<h4 class="h4">selfie</h4>
<h4 class="h4">front</h4>
<h4 class="h4">back</h4>
</div>
</div>

答案 1 :(得分:1)

使用Python lambda函数查找tag及其text,然后分解()。

from bs4 import BeautifulSoup
data='''<div class='wrap'>
       <div class='col'>
          <h4 class='h4'>photos</h4>
          <h4 class='h4'>videos</h4>
           <h4 class='h4'>prototypes</h4>
            <h4 class='h4'>weight</h4>
        </div>
      <div class='col'>
          <h4 class='h4'>color</h4>
           <h4 class='h4'>selfie</h4>
            <h4 class='h4'>front</h4>
             <h4 class='h4'>back</h4>
       </div>
       </div>'''

soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all(lambda tag:tag.name=='h4' and ('photos' in tag.text or 'prototypes' in tag.text) ):
    item.decompose()

print(soup)

输出:

<div class="wrap">
<div class="col">

<h4 class="h4">videos</h4>

<h4 class="h4">weight</h4>
</div>
<div class="col">
<h4 class="h4">color</h4>
<h4 class="h4">selfie</h4>
<h4 class="h4">front</h4>
<h4 class="h4">back</h4>
</div>
</div>