Question

BeatifulSoup4似乎将<br>，<img>和其他void elements解析为容器：

html = """\
<!doctype html>
<head><title>xyz</title></head>
<p>hey</p>
line<br>
<img src='x.jpg' alt='xyz'>
<p>wtf</p>
"""

import bs4
doc = bs4.BeautifulSoup(html)


for x in doc.children:
    print x
    print '----'

打印：

doctype html
----


----
<head><title>xyz</title></head>
----


----
<p>hey</p>
----

line
----
<br>
<img alt="xyz" src="x.jpg">
<p>wtf</p>
</img></br>
----

是否可以选择让BS正确解析这些标签？

Answer 1

解析器将其解释为容器。你本质上是在输入无效的HTML，然后由解析器来理解它。默认HTMLParser.HTMLParser() class只能用它来做这么多。

Switch parsers;您需要安装lxml或html5lib：

doc = BeautifulSoup(html, 'lxml')

或

doc = BeautifulSoup(html, 'html5lib')

演示（添加了lxml，添加了<html>和<body>标记，因此我先向下钻取了一下）：

>>> doc = BeautifulSoup(html, 'lxml')
>>> for x in doc.body.children:
...     print x
...     print '----'
... 
<p>hey</p>
----

line
----
<br/>
----


----
<img src="x.jpg"/>
----


----
<p>wtf</p>
----


----

此处html5lib完全相同;它是三个支持的HTML解析器中最慢的选项，但也是最准确的再现浏览器用破坏的HTML做的。

请注意，BeautifulSoup中的DocType处理有点奇怪;与Comment，ProcessingInstruction，CData和Declaration元素一样，str()版本的元素显示只是字符串内容，不包括前缀和后缀。使用NavigableString.output_ready()包含以下内容：

>>> next(doc.children)
u'html'
>>> doc = BeautifulSoup(html, 'html.parser')
>>> next(doc.children)
u'doctype html'
>>> type(next(doc.children))
<class 'bs4.element.Doctype'>
>>> next(doc.children).output_ready()
u'<!DOCTYPE doctype html>\n'

lxml并未在树中包含声明，但html5lib会声明。

BeautifulSoup不尊重html标签

1 个答案: