我正在与一个从论坛抓取数据的机器人合作。我在这里使用它:
<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>
我想从中得到
This is a test post with a few emotes :grin: :heart:
我该怎么做呢?如果表情符号也位于文本中间,我也希望能够做到这一点。
答案 0 :(得分:2)
from bs4 import BeautifulSoup, CData
txt = '''<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>'''
# load main soup:
soup = BeautifulSoup(txt, 'html.parser')
# find CDATA inside <description>, make new soup
soup2 = BeautifulSoup(soup.find('description').find(text=lambda t: isinstance(t, CData)), 'html.parser')
# replace <img> with their alt=...
for img in soup2.select('img'):
img.replace_with(img['alt'])
# print text
print(soup2.p.text)
打印:
This is a test post with a few emotes :grin: :heart: