Question

我有一个相当复杂的模板脚本，BeautifulSoup4由于某种原因不理解。如下所示，BS4仅在放弃之前部分解析到树中。为什么这样，有没有办法解决它？

>>> from bs4 import BeautifulSoup
>>> html = """<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script> Other stuff I want to stay"""
>>> soup = BeautifulSoup(html)
>>> soup.findAll('script')
[<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</script>]

编辑：在进一步测试时，出于某种原因，似乎BS3能够正确解析它：

>>> from BeautifulSoup import BeautifulSoup as bs3
>>> soup = bs3(html)
>>> soup.script
<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script>

Answer 1

Beautiful Soup有时会因其默认解析器而失败。 Beautiful Soup支持Python标准库中包含的HTML解析器，但它也支持许多第三方Python解析器。

在某些情况下，我必须将解析器更改为：lxml，html5lib或其他任何解析器。

这是上述解释的一个例子：

from bs4 import BeautifulSoup    

soup = BeautifulSoup(markup, "lxml")

我建议您阅读此http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

BeautifulSoup无法正确解析脚本文本/模板

1 个答案: