我正在尝试解析HTML文档中的大div
标记,并需要在div
中获取所有HTML和嵌套标记。我的代码:
innerTree = fromstring(str(response.text))
print("The tags inside the target div are")
print innerTree.cssselect('div.story-body__inner')
但它打印出来:
[<Element div at 0x66daed0>]
我想让它返回里面的所有HTML标签吗?如何使用LXML执行此操作?
答案 0 :(得分:1)
LXML是一个很棒的库。无需使用BeautiulSoup或任何其他。以下是如何获取您寻求的额外信息:
# import lxml HTML parser and HTML output function
from __future__ import print_function
from lxml.html import fromstring
from lxml.etree import tostring as htmlstring
# test HTML for demonstration
raw_html = """
<div class="story-body__inner">
<p>Test para with <b>subtags</b></p>
<blockquote>quote here</blockquote>
<img src="...">
</div>
"""
# parse the HTML into a tree structure
innerTree = fromstring(raw_html)
# find the divs you want
# first by finding all divs with the given CSS selector
divs = innerTree.cssselect('div.story-body__inner')
# but that takes a list, so grab the first of those
div0 = divs[0]
# print that div, and its full HTML representation
print(div0)
print(htmlstring(div0))
# now to find sub-items
print('\n-- etree nodes')
for e in div0.xpath(".//*"):
print(e)
print('\n-- HTML tags')
for e in div0.xpath(".//*"):
print(e.tag)
print('\n-- full HTML text')
for e in div0.xpath(".//*"):
print(htmlstring(e))
请注意,lxml
和cssselect
等xpath
函数会返回节点列表,而不是单个节点。您必须索引这些列表以获取包含的节点 - 即使只有一个节点。
获取所有子标签或子HTML可能意味着几件事:获取ElementTree
节点,获取标签名称或获取这些节点的完整HTML文本。这段代码演示了这三个。它通过使用XPath查询来实现。有时候CSS选择器更方便,有时候是XPath。在这种情况下,XPath查询.//*
表示&#34;返回当前节点下任何深度的任何标记名称的所有节点。
在Python 2下运行此结果如下。 (相同的代码在Python 3下运行正常,但输出文本略有不同,因为etree.tostring
在Python 3下返回字节字符串而非Unicode字符串。)
<Element div at 0x106eac8e8>
<div class="story-body__inner">
<p>Test para with <b>subtags</b></p>
<blockquote>quote here</blockquote>
<img src="..."/>
</div>
-- etree nodes
<Element p at 0x106eac838>
<Element b at 0x106eac890>
<Element blockquote at 0x106eac940>
<Element img at 0x106eac998>
-- HTML tags
p
b
blockquote
img
-- full HTML text
<p>Test para with <b>subtags</b></p>
<b>subtags</b>
<blockquote>quote here</blockquote>
<img src="..."/>