如何让BeautifulSoup 4尊重自动关闭标签?

时间:2013-02-19 15:47:05

标签: python xml xml-parsing beautifulsoup

此问题仅针对BeautifulSoup4,这与之前的问题不同:

Why is BeautifulSoup modifying my self-closing elements?

selfClosingTags in BeautifulSoup

由于BeautifulStoneSoup已经消失(之前的xml解析器),我怎样才能让bs4尊重新的自我关闭标记?例如:

import bs4   
S = '''<foo> <bar a="3"/> </foo>'''
soup = bs4.BeautifulSoup(S, selfClosingTags=['bar'])

print soup.prettify()

不会自动关闭bar标记,但会提示。 bs4指的是什么树构建器以及如何自我关闭标记?

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
  "BS4 does not respect the selfClosingTags argument to the "
<html>
 <body>
  <foo>
   <bar a="3">
   </bar>
  </foo>
 </body>
</html>

1 个答案:

答案 0 :(得分:12)

To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor.

soup = bs4.BeautifulSoup(S, 'xml')

You’ll need to have lxml installed.

您不再需要传递selfClosingTags

In [1]: import bs4
In [2]: S = '''<foo> <bar a="3"/> </foo>'''
In [3]: soup = bs4.BeautifulSoup(S, 'xml')
In [4]: print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<foo>
 <bar a="3"/>
</foo>