BeautifulSoup从一个标签获取文本,而在另一个标签中忽略文本

时间:2020-09-02 22:55:50

标签: python beautifulsoup tags

我有一些看起来像这样的文字:

 <item>
        <title>What Music Do You Build Robots to?</title>
        <dc:creator><![CDATA[@TaranMayer TaranMayer ]]></dc:creator>
        <description><![CDATA[ <aside class="quote no-group" data-username="DanMantz" data-post="34" data-topic="84065" data-full="true">
<div class="title">
<div class="quote-controls"></div>
<img alt="" width="20" height="20" src="https://www.vexforum.com/user_avatar/www.vexforum.com/danmantz/40/2285_2.png" class="avatar"> DanMantz:</div>
<blockquote>
<p>Classic Rock and Motown. I didn’t even consider that there are other options… <img src="https://www.vexforum.com/images/emoji/apple/slight_smile.png?v=9" title=":slight_smile:" class="emoji" alt=":slight_smile:"></p>
</blockquote>
</aside>
<p>This implies that you do indeed build robots. May we see some of your creations?</p> ]]></description>
        <link>https://www.vexforum.com/t/what-music-do-you-build-robots-to/84065/35</link>
        <pubDate>Wed, 02 Sep 2020 17:24:19 +0000</pubDate>
        <guid isPermaLink="false">www.vexforum.com-post-669073</guid>
</item>

使用bs4,我想获取<description>标记中除<blockquote>标记中的内容以外的所有内容的文本。我想得到这个:

This implies that you do indeed build robots. May we see some of your creations?

我该怎么做?我尝试寻求帮助,但找不到所需的东西。

1 个答案:

答案 0 :(得分:1)

要获取所需的文本,可以使用.extract()方法:

from bs4 import BeautifulSoup, CData


txt = """<item>
        <title>What Music Do You Build Robots to?</title>
        <dc:creator><![CDATA[@TaranMayer TaranMayer ]]></dc:creator>
        <description><![CDATA[ <aside class="quote no-group" data-username="DanMantz" data-post="34" data-topic="84065" data-full="true">
<div class="title">
<div class="quote-controls"></div>
<img alt="" width="20" height="20" src="https://www.vexforum.com/user_avatar/www.vexforum.com/danmantz/40/2285_2.png" class="avatar"> DanMantz:</div>
<blockquote>
<p>Classic Rock and Motown. I didn’t even consider that there are other options… <img src="https://www.vexforum.com/images/emoji/apple/slight_smile.png?v=9" title=":slight_smile:" class="emoji" alt=":slight_smile:"></p>
</blockquote>
</aside>
<p>This implies that you do indeed build robots. May we see some of your creations?</p> ]]></description>
        <link>https://www.vexforum.com/t/what-music-do-you-build-robots-to/84065/35</link>
        <pubDate>Wed, 02 Sep 2020 17:24:19 +0000</pubDate>
        <guid isPermaLink="false">www.vexforum.com-post-669073</guid>
</item>"""

# load main soup:
soup = BeautifulSoup(txt, "html.parser")

# find CData in description
desc = soup.find("description").find_next(text=lambda t: isinstance(t, CData))
# create new soup
desc = BeautifulSoup(desc, "html.parser")

# extract tags we don't want
for a in desc.select("aside"):
    a.extract()

# print the text:
print(desc.text.strip())

打印:

This implies that you do indeed build robots. May we see some of your creations?
相关问题