Question

我在这里编辑原帖以澄清，希望我把它归结为更容易管理的东西。我有一串xml看起来像：

<foo id="foo">
    <row>
        &lt;img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764"&gt;
    </row>
    <row>
        &lt;img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225"&gt;
    </row>
</foo>

所以，我正在做类似的事情：

xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)

结果如下：

<foo id="foo">
    <row>
        <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
    </row>
    <row>
        <img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
    </row>
</foo>

请注意，每个img标签上都没有结束标记。不确定这是我的问题，但可能。当我尝试做的时候：

images = xml.findAll('img')

它正在产生一个空列表。任何想法为什么BeautifulStoneSoup不会在这个xml片段中找到我的图像？

Answer 1

你没有找到img标签的原因是因为BeautifulSoup将它们视为“row”标签的文本部分。转换实体只是更改字符串，它不会更改文档的基础结构。以下不是一个很好的解决方案（它解析文档两次），但是当我在你的样本xml上测试它时它起作用了。这里的想法是将文本转换为坏的xml，然后再用美丽的汤清理它。

soup = BeautifulSoup(BeautifulSoup(text,convertEntities=BeautifulSoup.HTML_ENTITIES).prettify())
print soup.findAll('img')

BeautifulStoneSoup - 如何unescape和添加结束标签

1 个答案: