Question

我正在使用BeautifulSoup来解析XML：

In [64]: b = bs4.BeautifulSoup('<xml><t xml:space="preserve">     </t><t xml:space="preserve">  A  </t></xml>', 'xml')
In [65]: b.find_all('t')
Out[65]: [<t xml:space="preserve"> </t>, <t xml:space="preserve">  A  </t>]

因此，尽管有t属性，但第一个xml:space="preserve"代码中有5个空格折叠为1。

有没有办法让BeautifulSoup尊重xml:space="preserve"而不是折叠空格？

Answer 1

我无法就BeautifulSoup直接给出答案。但是，lxml可以为您执行此操作。

>>> from lxml import etree
>>> tree = etree.fromstring('<xml><t xml:space="preserve">     </t><t xml:space="preserve">  A  </t></xml>')
>>> [_.text for _ in tree.findall('t')]
['     ', '  A  ']

Make BeautifulSoup荣誉xml：space =“preserve”

1 个答案: