阻止BeautifulSoup删除空格

时间:2014-04-10 23:05:54

标签: python beautifulsoup

BeautifulSoup正在删除换行符之前的空格:

print BeautifulSoup("<?xml version='1.0' encoding='UTF-8'?><section>    \n</section>")

上面的代码打印:

<?xml version="1.0" encoding="utf-8"?>
<section>
</section>

请注意,节标记后面的四个空格丢失了!有趣的是,如果我这样做:

print BeautifulSoup("<?xml version='1.0' encoding='UTF-8'?><section>a    \n</section>")

我明白了:

<?xml version="1.0" encoding="utf-8"?>
<section>a    
</section>

'a'之后的四个空格现在存在!如何在原始打印声明中显示四个空格?

1 个答案:

答案 0 :(得分:0)

作为解决方法,您可以尝试在解析之前将所有<section>...</section>替换为<pre>...</section>。然后BeautifulSoup将完全保留空间。例如:

from bs4 import BeautifulSoup
import re

html = "<?xml version='1.0' encoding='UTF-8'?><section>    \n</section>"
html = re.sub(r'(\</?)(section)(\>)', r'\1pre\3', html)
soup = BeautifulSoup(html, "lxml")

print repr(soup.pre.text)    # repr used to show where the spaces are

给你:

u'    \n'