Question

我正在尝试解析HTML文档中的大div标记，并需要在div中获取所有HTML和嵌套标记。我的代码：

innerTree = fromstring(str(response.text))
print("The tags inside the target div are")
print innerTree.cssselect('div.story-body__inner')

但它打印出来：

[<Element div at 0x66daed0>]

我想让它返回里面的所有HTML标签吗？如何使用LXML执行此操作？

Answer 1

LXML是一个很棒的库。无需使用BeautiulSoup或任何其他。以下是如何获取您寻求的额外信息：

# import lxml HTML parser and HTML output function
from __future__ import print_function
from lxml.html import fromstring
from lxml.etree import tostring as htmlstring

# test HTML for demonstration
raw_html = """
    <div class="story-body__inner">
        <p>Test para with <b>subtags</b></p>
        <blockquote>quote here</blockquote>
        <img src="...">
    </div>
"""

# parse the HTML into a tree structure
innerTree = fromstring(raw_html)

# find the divs you want
# first by finding all divs with the given CSS selector
divs = innerTree.cssselect('div.story-body__inner')

# but that takes a list, so grab the first of those
div0 = divs[0]

# print that div, and its full HTML representation
print(div0)
print(htmlstring(div0))

# now to find sub-items

print('\n-- etree nodes')
for e in div0.xpath(".//*"):
    print(e)

print('\n-- HTML tags')
for e in div0.xpath(".//*"):
    print(e.tag)

print('\n-- full HTML text')
for e in div0.xpath(".//*"):
    print(htmlstring(e))

请注意，lxml和cssselect等xpath函数会返回节点列表，而不是单个节点。您必须索引这些列表以获取包含的节点 - 即使只有一个节点。

获取所有子标签或子HTML可能意味着几件事：获取ElementTree节点，获取标签名称或获取这些节点的完整HTML文本。这段代码演示了这三个。它通过使用XPath查询来实现。有时候CSS选择器更方便，有时候是XPath。在这种情况下，XPath查询.//*表示＆＃34;返回当前节点下任何深度的任何标记名称的所有节点。

在Python 2下运行此结果如下。（相同的代码在Python 3下运行正常，但输出文本略有不同，因为etree.tostring在Python 3下返回字节字符串而非Unicode字符串。）

<Element div at 0x106eac8e8>
<div class="story-body__inner">
        <p>Test para with <b>subtags</b></p>
        <blockquote>quote here</blockquote>
        <img src="..."/>
    </div>


-- etree nodes
<Element p at 0x106eac838>
<Element b at 0x106eac890>
<Element blockquote at 0x106eac940>
<Element img at 0x106eac998>

-- HTML tags
p
b
blockquote
img

-- full HTML text
<p>Test para with <b>subtags</b></p>
<b>subtags</b>
<blockquote>quote here</blockquote>  
<img src="..."/>

使用LXML获取所有HTML元素

1 个答案: