BeautifulSoup保留一些文本,但删除其余标签

时间:2020-09-02 19:28:23

标签: python beautifulsoup

我正在与一个从论坛抓取数据的机器人合作。我在这里使用它:

<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>

我想从中得到

This is a test post with a few emotes :grin: :heart:

我该怎么做呢?如果表情符号也位于文本中间,我也希望能够做到这一点。

1 个答案:

答案 0 :(得分:2)

from bs4 import BeautifulSoup, CData

txt = '''<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>'''

# load main soup:
soup = BeautifulSoup(txt, 'html.parser')

# find CDATA inside <description>, make new soup
soup2 = BeautifulSoup(soup.find('description').find(text=lambda t: isinstance(t, CData)), 'html.parser')

# replace <img> with their alt=...
for img in soup2.select('img'):
    img.replace_with(img['alt'])

# print text
print(soup2.p.text)

打印:

This is a test post with a few emotes :grin: :heart: