Question

lxml似乎在html文档中缺少一个默认文档类型时添加默认文档类型。

请参阅此演示代码：

import lxml.etree
import lxml.html


def beautify(html):
    parser = lxml.etree.HTMLParser(
        strip_cdata=True,
        remove_blank_text=True
    )

    d = lxml.html.fromstring(html, parser=parser)
    docinfo = d.getroottree().docinfo

    return lxml.etree.tostring(
        d,
        pretty_print=True,
        doctype=docinfo.doctype,
        encoding='utf8'
    )


with_doctype = """
<!DOCTYPE html>
<html>
<head>
  <title>With Doctype</title>
</head>
</html>
"""

# This passes!
assert "DOCTYPE" in beautify(with_doctype)

no_doctype = """<html>
<head>
  <title>No Doctype</title>
</head>
</html>"""

# This fails!
assert "DOCTYPE" not in beautify(no_doctype)

# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before

如何告诉lxml不要这样做？

此问题最初是在此处提出的： https://github.com/mitmproxy/mitmproxy/issues/845

引用comment on reddit可能会有所帮助：

lxml基于libxml2，默认情况下会这样做，除非你传递选项HTML_PARSE_NODEFDTD，我相信。代码here。

我不知道你是否可以告诉lxml传递那个选项.. libxml有python绑定你可以直接使用但它们看起来真的很毛茸茸。

编辑：做了一些挖掘，该选项确实出现在lxml soure here中。该选项完全符合您的要求但我不确定如何激活它，如果它甚至可能。

Answer 1

目前无法在lxml中执行此操作，但我创建了一个Pull Request on lxml，它为default_doctype添加了HTMLParser布尔值。

一旦代码被合并，解析器需要像这样创建：

parser = lxml.etree.HTMLParser(
    strip_cdata=True,
    remove_blank_text=True,
    default_doctype=False,
)

其他一切都保持不变。

如何防止lxml添加默认doctype

1 个答案: