BeautifulSoup - 如何从HTML末尾删除没有文本和空格的嵌套标签

时间:2015-10-16 02:43:15

标签: python beautifulsoup

我正在尝试使用BeautifulSoup删除其中没有文字的标签。例如,我有以下标记:

<p>
   <p>
       <br/>
   </p>
</p>

<p>
   <br/>
</p>

我有以下功能:

@staticmethod
def stripTagWithNoText(soup,tagname,**kwargs):
    """Strip tags with no text"""
    #Make sure that soup and tags were defined
    assert isinstance(tagname,str)

    #Remove tags with no text
    for tag in soup.find_all(tagname):
        if tag.string:
            continue
        for subtag in tag.findChildren():
            if subtag.string:
                break
        else:
            continue
        tag.extract()

但是,这也删除了以下标签:

<p>This is some random text</p>

有人能发现这有什么问题吗?

另外,假设我在html的末尾附加了以下内容:

<p><br />
</p><p><br /> 
</p><p><br />
</p><p><br /> 
</p><p><br />
</p><p><br />
</p>

是否有某种方法可以删除html末尾的所有空格,类似于string_text.strip()?

注意 我使用的是Python3,bs4

1 个答案:

答案 0 :(得分:0)

这对你有用吗?

from bs4 import BeautifulSoup
from bs4.element import Tag

def main():
    test = """
    <p>
    this should not be here
       <p>this should not be here
           <br/>this should not be here
       </p>
       this should not be here
    </p>
    """
    soup = BeautifulSoup(test, 'html.parser')

    def stripTagWithNoText(soup, tagname):
        def remove(node):
            for index, item in enumerate(node.contents):
                if isinstance(item, Tag):
                    remove(node.contents[index])
                else:
                    node.contents[index] = ''

        #Remove tags with no text
        for tag in soup.find_all(tagname):
            remove(tag)
        print(soup)

    stripTagWithNoText(soup, 'p')
    return 0

if __name__ == '__main__':
    main()