Question

我正在使用BeautifulSoup 4（bs4）来读取XML RSS提要，并且遇到了以下条目。我正在尝试阅读<content:encoded><![CDATA[...]]</content>标记中包含的内容：

<item>
    <title>Foobartitle</title>
    <link>http://www.acme.com/blah/blah.html</link>
    <category><![CDATA[mycategory]]></category>
    <description><![CDATA[The quick brown fox jumps over the lazy dog]]></description>
    <content:encoded>
        <![CDATA[<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>]]>
    </content:encoded>
</item>

据我所知，这种格式是RSS content module的一部分，非常常见。

我想隔离<content:encoded>标签，然后阅读CDATA内容。为避免疑义，结果为<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>。

我查看了this，this和this stackoverflow帖子，但由于它们没有直接关系，我无法弄清楚如何完成工作对我来说。

我正在使用lxml XML解析器和bs4。

有什么建议吗？谢谢！

Answer 1

from bs4 import BeautifulSoup

doc = ...
soup = BeautifulSoup(doc, "xml")  # Directs bs to use lxml

有趣的是，BeautifulSoup / lxml会更改标签，明显地从content:encoded更改为encoded。

>>> print soup
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Foobartitle</title>
<link>http://www.acme.com/blah/blah.html</link>
<category>mycategory</category>
<description>The quick brown fox jumps over the lazy dog</description>
<encoded>
        &lt;p&gt;&lt;img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /&gt;&lt;/p&gt;
    </encoded>
</item>

从那里，它应该只是解析孩子们。

for encoded_content in soup.findAll("encoded"):
    for child in encoded_content.children:
        print child

结果为<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>。请注意，这似乎是bs4.element.NavigableString的一个实例，而不是链接答案中的CData。

使用BeautifulSoup 4 </content：encoded>读取<content：encoded>标签

1 个答案: