Question

我想用BeautifulSoup解析html页面的一部分。

这是我的代码：

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

body = """Some text
<body{block:PermalinkPage} class="inside"{/block:PermalinkPage}>
Some text
"""

print BeautifulSoup(body, 'html5lib')

输出

<html><head></head><body>Some text
<body{block:permalinkpage} block:permalinkpage}="" class="inside" {="">
Some text
</body{block:permalinkpage}></body></html>

所需的输出是

<html><head></head><body>Some text
<body{block:PermalinkPage} class="inside"{/block:PermalinkPage}>
Some text
</body{block:permalinkpage}></body></html>

为什么BeautifulSoup会如此改变这段代码？是否有可能像我期望的那样强迫它发挥作用？我应该用什么库来获得所需的输出？

Answer 1

这看起来不像有效的HTML（虽然我可能是错的）。在BeautifulSoup下面使用一个解析器，在这种情况下，你明确强制为html5lib。如果底层解析器无法处理您的输入，则bs4也不会。

看起来你正在为它提供一些可以处理成html的逻辑模板语言（例如mustache或slim），但是没有任何上下文很难说。

BeautifulSoup：解析部分页面（tumblr模板），意外结果

1 个答案: