使用beautifulsoup decompose()去除多个不需要的标签

时间:2017-10-27 19:40:17

标签: python python-3.x web-scraping beautifulsoup

我正在尝试清理HTML,以便我拥有的只是我想要的相关文本。下面的代码使用第一个定义的函数清除上标标签,在使用.get_text之前,我不想做同样的事情,但使用'h4','h1','a'和'li'标签。< / p>

import requests
from bs4 import BeautifulSoup
url = "https://www.biblegateway.com/passage/?search=Luke+14%3A12-14&version=NIV"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

def supClean(verseWithSup):
    for sup in verseWithSup:
        verseWithSup.sup.decompose()
    return verseWithSup

def verseExtract(soup):
    verseName = soup.find(class_="passage-display-bcv").get_text()
    verseWithSup = soup.find(class_="passage-text")
    verseBody = supClean(verseWithSup).get_text()
    return verseName, verseBody

verseName, verseBody = (verseExtract(soup))

print(verseName)
print(verseBody)

我现在就明白了:

  路加福音14:12-14新国际版(NIV)然后耶稣对他说   主持人,“当你举办午餐或晚餐时,不要邀请你的朋友,   你的兄弟姐妹,亲戚或富裕的邻居;如果   你这样做,他们可能会邀请你回来,所以你会得到偿还。但当   你举行宴会,邀请穷人,残疾人,瘸子,   失明,14,你会很幸福。虽然他们无法报答你,但你   将在义人的复活中得到偿还。“十字架   参考文献:路加福音14:13:路加福音21:14:14:15

     

新国际版(NIV)圣经,新国际   Version®,NIV®版权所有©1973,1978,1984,2011 by Biblica,Inc.®   经许可使用。全世界保留所有权利。

但我不想这样:

  

然后耶稣对他的主人说:“当你举行午餐或晚餐时,请做   不要邀请你的朋友,兄弟姐妹,亲戚,或   你富裕的邻居;如果你这样做,他们可能会邀请你回来,所以你   将被偿还。但是当你举行宴会时,请邀请穷人   残缺的,瘸腿的,瞎子,14,你会得到祝福。虽然   他们无法偿还你,你将在复活时得到偿还   义“。

这就是HTML的样子:

<div class="passage-text">
 <div class="passage-wrap">
  <div class="passage-content passage-class-0">
   <div class="version-NIV result-text-style-normal text-html ">
    <h1 class="passage-display">
     <span class="passage-display-bcv">
      Luke 14:12-14
     </span>
     <span class="passage-display-version">
      New International Version (NIV)
     </span>
    </h1>
    <p>
     <span class="text Luke-14-12" id="en-NIV-25566">
      Then Jesus said to his host,
      <span class="woj">
       “When you give a luncheon or dinner, do not invite your friends, your brothers or sisters, your relatives, or your rich neighbors; if you do, they may invite you back and so you will be repaid.
      </span>
     </span>
     <span class="text Luke-14-13" id="en-NIV-25567">
      <span class="woj">
       But when you give a banquet, invite the poor, the crippled, the lame, the blind,
       <sup class="crossreference" data-cr="#cen-NIV-25567A" data-link='(&lt;a href="#cen-NIV-25567A" title="See cross-reference A"&gt;A&lt;/a&gt;)'>
       </sup>
      </span>
     </span>
     <span class="text Luke-14-14" id="en-NIV-25568">
      <span class="woj">
       <sup class="versenum">
        14
       </sup>
       and you will be blessed. Although they cannot repay you, you will be repaid at the resurrection of the righteous.”
       <sup class="crossreference" data-cr="#cen-NIV-25568B" data-link='(&lt;a href="#cen-NIV-25568B" title="See cross-reference B"&gt;B&lt;/a&gt;)'>
       </sup>
      </span>
     </span>
    </p>
    <div class="crossrefs hidden">
     <h4>
      Cross references:
     </h4>
     <ol>
      <li id="cen-NIV-25567A">
       <a href="#en-NIV-25567" title="Go to Luke 14:13">
        Luke 14:13
       </a>
       :
       <a class="crossref-link" data-bibleref="Luke 14:21" href="/passage/?search=Luke+14%3A21&amp;version=NIV">
        ver 21
       </a>
      </li>
      <li id="cen-NIV-25568B">
       <a href="#en-NIV-25568" title="Go to Luke 14:14">
        Luke 14:14
       </a>
       :
       <a class="crossref-link" data-bibleref="Acts 24:15" href="/passage/?search=Acts+24%3A15&amp;version=NIV">
        Ac 24:15
       </a>
      </li>
     </ol>
    </div>
    <!--end of crossrefs-->
   </div>
   <div class="publisher-info-bottom with-single">
    <strong>
     <a href="/versions/New-International-Version-NIV-Bible/">
      New International Version
     </a>
     (NIV)
    </strong>
    <p>
     Holy Bible, New International Version®, NIV® Copyright ©1973, 1978, 1984, 2011 by
     <a href="http://www.biblica.com/">
      Biblica, Inc.®
     </a>
     Used by permission. All rights reserved worldwide.
    </p>
   </div>
  </div>
 </div>
</div>

2 个答案:

答案 0 :(得分:1)

给它一个机会。如果你想要从这节经文中摆脱121314,请告诉我。

from bs4 import BeautifulSoup
import requests           

link= "https://www.biblegateway.com/passage/?search=Luke+14%3A12-14&version=NIV"

soup = BeautifulSoup(requests.get(link).text,"lxml")
for item in soup.select("[id^='en-NIV-']"):
    print(item.text.strip())

输出:

12 Then Jesus said to his host, “When you give a luncheon or dinner, do not invite your friends, your brothers or sisters, your relatives, or your rich neighbors; if you do, they may invite you back and so you will be repaid.
13 But when you give a banquet, invite the poor, the crippled, the lame, the blind,
14 and you will be blessed. Although they cannot repay you, you will be repaid at the resurrection of the righteous.”

或者为了摆脱这段经文,您可以尝试如下:

import requests
from lxml.html import fromstring           

link= "https://www.biblegateway.com/passage/?search=Luke+14%3A12-14&version=NIV"  
root = fromstring(requests.get(link).text)
for item in root.cssselect("[id^='en-NIV-'],.woj"):
    for data in item:data.drop_tree()
    print(item.text_content())

结果:

Then Jesus said to his host, 
“When you give a luncheon or dinner, do not invite your friends, your brothers or sisters, your relatives, or your rich neighbors; if you do, they may invite you back and so you will be repaid.
But when you give a banquet, invite the poor, the crippled, the lame, the blind,
and you will be blessed. Although they cannot repay you, you will be repaid at the resurrection of the righteous.”

答案 1 :(得分:0)

虽然上面提供的答案很有效,但我最终得到了这个代码来做我想要的bs4。

for item in soup.select("sup, div.publisher-info-bottom.with-single"):
    item.decompose()

然后我使用下面的内容来获取文本并按照我想要的方式对其进行格式化。

verseBody = soup.select("div.passage-text p")
formedVerse = ""
for p in verseBody:
    formedVerse += p.get_text() + "\n\n"