我想在Python中编写一个XPath来获取li
标记的全部内容,以便包含a
标记的内容。
<li>
Lake 2014: 9th Biennial Lake Symposium on "
<a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
", 13-15th November 2014
</li>
我把XPath写成
//div[@class='inner_body_left']/ul/li//text().
这会输出3个不同的字符串:
Lake 2014: 9th Biennial Lake Symposium on "
Conservation of Wetland Ecosystems in Western Ghats
", 13-15th November 2014.
我如何将它们作为单个字符串?
答案 0 :(得分:1)
最好的选择似乎是简单地使用string()
来实现您想要达到的目标。它还会从XML中删除注释。它将整个元素转换为xs:string:
//div[@class='inner_body_left']/ul/li/string()
如果由于某些与业务逻辑相关的原因而不起作用,您可以始终连接字符串:
concat(//div[@class='inner_body_left']/ul/li//text())
答案 1 :(得分:1)
示例Python shell会话:
>>> import lxml.html
>>> doc = lxml.html.fromstring("""<div class="inner_body_left">
... <ul>
... <li>
... Lake 2014: 9th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2014
... </li>
... </ul>
... </div>""")
最简单的是使用string()
如果您知道您的XPath表达式只匹配1个节点,否则string()
仅转换匹配节点集中的第1个节点:
>>> doc.xpath("string(//div[@class='inner_body_left']/ul/li)")
'\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n'
获取所有文本元素:
>>> doc.xpath("//div[@class='inner_body_left']/ul/li//text()")
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n']
>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*/text()")
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n']
从a
元素中排除文字(使用/descendant-or-self::*[not(self::a)]/
代替//
:
>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()")
['\nLake 2014: 9th Biennial Lake Symposium on "\n', '\n", 13-15th November 2014\n']
>>> "".join(doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()") )
'\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n'
>>>
更新了包含多个元素的示例:
>>> doc = """<div class="inner_body_left">
... <ul>
... <li>
... Lake 2014: 9th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2014
... </li>
... <li>
... Lake 2015: 10th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2015
... </li>
... </ul>
... </div>"""
>>> root = lxml.html.fromstring(doc)
>>>
>>> import pprint
>>> pprint.pprint([element.xpath("string(.)")
... for element in root.xpath("//div[@class='inner_body_left']/ul/li")])
['\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n',
'\nLake 2015: 10th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2015\n']
>>> pprint.pprint(["".join(element.xpath("./descendant-or-self::*[not(self::a)]/text()"))
... for element in root.xpath("//div[@class='inner_body_left']/ul/li")]
... )
['\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n',
'\nLake 2015: 10th Biennial Lake Symposium on "\n\n", 13-15th November 2015\n']
>>>
答案 2 :(得分:0)
请参阅my solution
我用过
concat(substring(//div/ul/li/text()[1],1,string-length(//div/ul/li/text()[1])-1),//div/ul/li/a/text(),substring(//div/ul/li/text()[2],2))
用于
<?xml version="1.0" encoding="UTF-8"?><div>
<ul>
<li>
Lake 2014: 9th Biennial Lake Symposium on "
<a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
", 13-15th November 2014
</li>
</ul>
</div>
要获得单行,我们必须先删除换行符,然后再使用子串函数删除一行