Question

使用函数prettify()我可以打印出格式良好的html代码，并且我已经读过这个函数可以正确打印出破坏的html代码（例如，如果标签已打开但从未关闭，{{1}帮助解决这个问题）。但只有这个函数可以做到这一点，或者在将数据加载到Beautiful Soup对象之后，如： prettify），导致现在汤包含一个代码，该代码可以抵抗损坏的HTML代码。例如，如果我的代码有问题：

soup = BeautifulSoup(data

然后我将它加载到BS对象中，它可以在汤对象里面看到，也可以作为固定对象？：

<body>
 <p><b>Paragraph.</p>
</body>

Answer 1

HTML marekup在创建汤时更正，而不是在漂亮打印时。这是必需的，以便BeautifulSoup可以正确地导航文档。

如下所示，汤的字符串表示包含更正的标记：

>>> from bs4 import BeautifulSoup
>>> text="""<body>
...  <p><b>Paragraph.</p>
... </body>
... """
>>> soup = BeautifulSoup(text)
>>> str(soup)
'<body>\n<p><b>Paragraph.</b></p>\n</body>\n'
>>>

如果你阅读了class BeautifulStoneSoup的来源，你会发现以下评论来解决你的破产标记：

    This class contains the basic parser and search code. It defines
    a parser that knows nothing about tag behavior except for the
    following:

      You can't close a tag without closing all the tags it encloses.
      That is, "<foo><bar></foo>" actually means
      "<foo><bar></bar></foo>".

然后further down the source，您可以看到BeautifulSoup继承自BeautifulStoneSoup。

使用Beautiful Soup整理代码

1 个答案: