Xpath提取当前节点内容,包括所有子节点

时间:2015-04-27 05:05:41

标签: python xpath lxml

我在提取当前节点内容时遇到了问题,包括所有子节点。

就像下面的代码一样,我想获得字符串 abcdefg<b>b1b2b3</b> 在预标签。

但我无法使用&#34; child :: *&#34;为拿到它,为实现它。 如果我使用&#34; / text()&#34;,我丢失了b标签格式信息。请帮帮我。

# -*- coding: utf-8 -*-
from lxml import html
import lxml.etree as le

input = "<pre>abcdefg<b>b1b2b3</b></pre>"
input_xpath = "//pre/child::*"
tree = html.fromstring(input)
result = tree.xpath(input_xpath)
result1 = [le.tostring(item) for item in result]
result2 = ''.join(result1)
print result2

output: <b>b1b2b3</b>

2 个答案:

答案 0 :(得分:2)

要获取XML节点的内容标记(有时称为"innerXML"),您可以先选择节点(而不是选择子节点或文本内容):

from lxml import html
import lxml.etree as le

input = "<pre>abcdefg<b>b1b2b3</b></pre>"
tree = html.fromstring(input)
node = tree.xpath("//pre")[0]

然后将文本内容与所有子节点标记结合起来:

result = node.text + ''.join(le.tostring(e) for e in node)
print result

输出:

abcdefg<b>b1b2b3</b>

答案 1 :(得分:0)

尝试使用以下

替换您的xpath
In [0]: input = "<pre>abcdefg<b>b1b2b3</b></pre>"

In [1]: input_xpath = "//pre//text()"

In [2]: tree = html.fromstring(input)

In [3]: result = tree.xpath(input_xpath)

In [4]: result
Out[5]: ['abcdefg', 'b1b2b3']