Question

为了找出字符串包含具有正确语法的有效html的最佳技术是什么？

我尝试从html.parser模块查看HTMLParser，如果在解析过程中没有产生任何错误，我得出结论该字符串是有效的HTML。但是，这对我没有帮助，因为它甚至在解析无效字符串时也不会引发任何错误。

from html.parser import HTMLParser

parser = HTMLParser()

parser.feed('<h1> hi')
parser.close()

我希望它会引发一些异常或错误，因为缺少结束标记，但没有。

Answer 1

    from bs4 import BeautifulSoup
    st = """<html>
    ... <head><title>I'm title</title></head>
    ... </html>"""
    st1="who are you"
    bool(BeautifulSoup(st, "html.parser").find())
    True
    bool(BeautifulSoup(st1, "html.parser").find())
    False

Answer 2

来自html.parser的传统HTMLParser不会验证HTML标记中的错误，它只会“标记”字符串中的每个内容。

您可能想看看py_w3c。看起来没有人照看这个模块，但是可以肯定地发现错误：

from py_w3c.validators.html.validator import HTMLValidator


val = HTMLValidator()
val.validate_fragment("<h1> hey yo")

for error in val.errors:
    print(error.get("message"))

$ python3.7 html-parser.py
Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
Element “head” is missing a required instance of child element “title”.
End of file seen and there were open elements.
Unclosed element “h1”.

验证字符串是否是python中的有效HTML？

2 个答案: