Question

我正在尝试使用Python使用BeautifulSoup解析HTML文档。

但它会停止解析特殊字符，例如：

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''
soup = BeautifulSoup(doc,  'html.parser')
print(soup)

此代码应输出整个文档。相反，它只打印

<html>
<body>
<div>And I said «What the %</div></body></html>

文档的其余部分显然已丢失。它由组合'&#'停止。

问题是，如何设置BS或预处理文档，以避免此类问题，但尽可能少地丢失文本（可能提供信息）？

我在Windows 10上使用版本4.6.0的bs4和Python 3.6.1。

更新即可。方法soup.prettify()不起作用，因为soup已经被破坏。

Answer 1

您需要使用＆＃34; html5lib＆＃34;作为解析器而不是＆＃34; html.parser＆＃34;在您的BeautifulSoup对象中。例如：

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''

soup = BeautifulSoup(doc,  'html5lib')
#          different parser  ^

现在，如果您打印soup，它将显示您想要的字符串：

>>> print(soup)
<html><head></head><body>
        <div>And I said «What the %&amp;#@???»</div>
        <div>some other text</div>

</body></html>

来自Difference Between Parsers文件：

与html5lib不同，html.parser不会尝试通过添加标记来创建格式正确的HTML文档。与lxml不同，它甚至不需要添加标签。

使用BeautifulSoup解析HTML时缺少特殊字符和标记

1 个答案: