用于获取一个字符串中标记下的所有内容的xpath

时间:2014-04-30 13:17:37

标签: xpath

我想在Python中编写一个XPath来获取li标记的全部内容,以便包含a标记的内容。

<li>
Lake 2014: 9th Biennial Lake Symposium on "
<a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
", 13-15th November 2014
</li>

我把XPath写成

//div[@class='inner_body_left']/ul/li//text().

这会输出3个不同的字符串:

Lake 2014: 9th Biennial Lake Symposium on "
Conservation of Wetland Ecosystems in Western Ghats
", 13-15th November 2014.

我如何将它们作为单个字符串?

3 个答案:

答案 0 :(得分:1)

最好的选择似乎是简单地使用string()来实现您想要达到的目标。它还会从XML中删除注释。它将整个元素转换为xs:string:

//div[@class='inner_body_left']/ul/li/string()

如果由于某些与业务逻辑相关的原因而不起作用,您可以始终连接字符串:

concat(//div[@class='inner_body_left']/ul/li//text())

答案 1 :(得分:1)

示例Python shell会话:

>>> import lxml.html
>>> doc = lxml.html.fromstring("""<div class="inner_body_left">
... <ul>
... <li>
... Lake 2014: 9th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2014
... </li>
... </ul>
... </div>""")

最简单的是使用string()如果您知道您的XPath表达式只匹配1个节点,否则string()仅转换匹配节点集中的第1个节点:

>>> doc.xpath("string(//div[@class='inner_body_left']/ul/li)") 
'\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n'

获取所有文本元素:

>>> doc.xpath("//div[@class='inner_body_left']/ul/li//text()") 
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n']
>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*/text()") 
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n']

a元素中排除文字(使用/descendant-or-self::*[not(self::a)]/代替//

>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()") 
['\nLake 2014: 9th Biennial Lake Symposium on "\n', '\n", 13-15th November 2014\n']
>>> "".join(doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()") )
'\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n'
>>> 

更新了包含多个元素的示例:

>>> doc = """<div class="inner_body_left">
... <ul>
... <li>
... Lake 2014: 9th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2014
... </li>
... <li>
... Lake 2015: 10th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2015
... </li>
... </ul>
... </div>"""
>>> root = lxml.html.fromstring(doc)
>>>
>>> import pprint
>>> pprint.pprint([element.xpath("string(.)")
...                for element in root.xpath("//div[@class='inner_body_left']/ul/li")])
['\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n',
 '\nLake 2015: 10th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2015\n']
>>> pprint.pprint(["".join(element.xpath("./descendant-or-self::*[not(self::a)]/text()"))
...                for element in root.xpath("//div[@class='inner_body_left']/ul/li")]
... )
['\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n',
 '\nLake 2015: 10th Biennial Lake Symposium on "\n\n", 13-15th November 2015\n']
>>> 

答案 2 :(得分:0)

请参阅my solution

我用过

concat(substring(//div/ul/li/text()[1],1,string-length(//div/ul/li/text()[1])-1),//div/ul/li/a/text(),substring(//div/ul/li/text()[2],2))

用于

<?xml version="1.0" encoding="UTF-8"?><div>
  <ul>
<li>
Lake 2014: 9th Biennial Lake Symposium on "
<a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
", 13-15th November 2014
</li>
  </ul>
</div>

要获得单行,我们必须先删除换行符,然后再使用子串函数删除一行