删除标签的实用程序功能

Question

我想要使用漂亮的汤来提取标签中的文本以及html的以下位后的位：

<p><i>Italic stuff</i> Not Italic stuff</p>

所以我做了

soup = BeautifulSoup('<p><i>Italic stuff</i> Not Italic stuff</p>')
ital = soup.i.string
notital = soup.string

但是，soup.string返回None，而不是'Not Italic stuff ...我做错了什么？

谢谢！

Answer 1

来自.string属性的文档：

如果此标记具有单个字符串子级，则返回值为该字符串。如果此标记没有子级或多个子级，则返回值为没有。如果此标记有一个子标记，则返回值为“字符串” 子标记的属性，递归地。

您似乎需要提取i元素的拖尾文本，如this answer所示：

In [12]: soup.i.findNextSibling(text=True)
Out[12]: u' Not Italic stuff'

Answer 2

删除标签的实用程序功能

def strip_tags(html, invalid_tags):
   soup = BeautifulSoup(html)

   for tag in soup.findAll(True):
      if tag.name in invalid_tags:
         s = ""
       for c in tag.contents:
           if not isinstance(c, NavigableString):
               c = strip_tags(unicode(c), invalid_tags)
           s += unicode(c)

       tag.replaceWith(s)

 return soup

使用方法相应地删除标签

        Invalid tags which we want to remove from the content
        invalid_tags = ['p', 'div', 'a', 'strong', 'img', 'span', 'br', 'h1', 'h2', 'h3', 'h5', 'h6', 'em']

当存在其他标签时提取<p>标签的内容</p>

2 个答案:

删除标签的实用程序功能

使用方法相应地删除标签