Question

我们将基于HTML的文档转换为书籍形式。输入的HTML通常包含许多换行符和缩进的行，因此在普通的文本编辑器中人们可读。此类缩进线主要由空格组成。浏览器通常会忽略这些空格。例如：

    <p>
        This is a text with two lines<br>
        and this is the second line.
    </p>

在浏览器中呈现此字符时，两行前面的空格和
之后的换行符将被完全忽略，并且文本显示为HTML代码看起来像这样：

<p>This is a text with two lines<br>and this is the second line.</p>

我需要python中的一个函数，该函数可以解析第一个HTML代码并输出第二个HTML代码，而无需使用“漂亮打印”空格。最好的解决方案还将创建XHTML，以便可以使用ElementTree对其进行解析。

我听说BeautifulSoup可以做到这一点，但似乎无法按预期进行。以下是一些示例：

from bs4 import BeautifulSoup
input = """    <p>
        This is a text with two lines<br>
        and this is the second line.
    </p>"""
soup = BeautifulSoup(input, 'html.parser')
print unicode(soup)

这将打印以下字符串：

u' <p>\n        This is a text with two lines<br/>\n        and this is the second line.\n    </p>'

如您所见，<p>前面有一个空格，而且换行符和空格仍然存在。您可以使用lxml解析器获得类似的输出：

u'<html><body><p>\n        This is a text with two lines<br/>\n        and this is the second line.\n    </p></body></html>'

然后有prettify方法可用的格式化程序。完全不使用格式化程序，结果将与我期望的类似。

soup.prettify(formatter = None)

结果：

u'<p>\n This is a text with two lines\n <br>\n and this is the second line.\n</p>'

但是那里仍然有换行符。现在<br>之前还有换行符，这对我来说毫无意义。

即使我要遍历所有文本并以任何内容替换换行符，仍然会有一些空格不应出现的空白。是否有更好的库或我没有找到的东西可以帮助我创建以下结果？

<p>This is a text with two lines<br>and this is the second line.</p>

Answer 1

现在您已经通过BeautifullSoup或Prettify或其他方法以正确的方式格式化了结果，则可以使用带有re.sub()的正则表达式执行替换。

import re

s = "<p>\n This is a text with two lines\n <br>\n and this is the second line.\n</p>"
replaced = re.sub('\n ', '', s)

print replaced

Answer 2

假设HTML格式正确，并且没有不属于文档结构一部分的<或>符号（例如在注释或JavaScript块中），则可以使用此正则表达式替换立即替换所有空白并跟随所有HTML标签：

import re

input = """    <p>
        This is a text with two lines<br>
        and this is the second line.
    </p>"""

print(re.sub(r'\s*(<.*?>)\s*', r'\1', input))

Answer 3

尝试使用此unicode代码留白。

&nbsp;

https://www.w3schools.com/html/html_entities.asp

漂亮的HTML代码中的哪些空格可以忽略？

3 个答案: