解决

Question

因为我第二次遇到这个恼人的问题，所以我觉得这个问题会有所帮助。

有时候我必须从XML文档中获取Elements，但是这样做的方法很尴尬。

我想知道一个能够实现我想要的python库，一种优雅的方式来表示我的XPath，一种自动注册前缀中的命名空间或者在内置XML实现中注册隐藏首选项或者在lxml中删除命名空间的方法完全。除非您已经知道我想要的内容，否则将进行澄清：）

实施例-DOC：

<root xmlns="http://really-long-namespace.uri"
  xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>

我能做什么

ElementTree API是唯一内置的（我知道）提供XPath查询。但它需要我使用“UNames”。看起来像这样：/{http://really-long-namespace.uri}root/{http://with-ambivalent.end/#}elem

正如您所看到的，这些都非常冗长。我可以通过以下方式缩短它们：

default_ns = "http://really-long-namespace.uri"
other_ns   = "http://with-ambivalent.end/#"
doc.find("/{{{0}}}root/{http…end/#}elem".format(default_ns, other_ns))

但这是{{{ugly}}}和脆弱，因为http…end#≃http…end/≃http…end≃/*[local-name() = 'root']/*[local-name() = 'elem']，我是谁知道哪个变种会用吗？

此外，lxml支持名称空间前缀，但它既不使用文档中的名称前缀，也不提供处理默认名称空间的自动方法。我仍然需要获取每个命名空间的一个元素以从文档中检索它。命名空间属性不会被保留，因此也无法自动从这些属性中检索它们。

还有一种与命名空间无关的XPath查询方式，但它在内置实现中既详细又丑陋且不可用：/root/elem

我想做什么

我想找到一个库，选项或通用XPath变形函数，通过输入以下内容来实现上述示例......

取消名称空间：/root/other:elem
文档中的命名空间前缀：#parse the document into a DOM tree rdf_tree = xml.dom.minidom.parse("install.rdf") #read the default namespace and prefix from the root node context = xpath.XPathContext(rdf_tree) name = context.findvalue("//em:id", rdf_tree) version = context.findvalue("//em:version", rdf_tree) #<Description/> inherits the default RDF namespace resource_nodes = context.find("//Description/following-sibling::*", rdf_tree)

...加上一些我确实想要使用文档前缀或删除命名空间的语句。

进一步澄清：虽然我目前的用例很简单，但将来我将不得不使用更复杂的用例。

感谢阅读！

解决

用户样本库引导我注意py-dom-xpath;正是我在寻找什么。我的实际代码现在看起来像这样：

{{1}}

与文档一致，简单，名称空间感知;完美。

Answer 1

*[local-name() = "elem"]语法应该有效，但为了使其更容易，您可以创建一个函数来简化部分或完整“通配符命名空间”XPath表达式的构造。

我在Ubuntu 10.04上使用 python-lxml 2.2.4 ，下面的脚本适合我。您需要根据您希望为每个元素指定默认命名空间的方式自定义行为，并处理要折叠到表达式中的任何其他XPath语法：

import lxml.etree

def xpath_ns(tree, expr):
    "Parse a simple expression and prepend namespace wildcards where unspecified."
    qual = lambda n: n if not n or ':' in n else '*[local-name() = "%s"]' % n
    expr = '/'.join(qual(n) for n in expr.split('/'))
    nsmap = dict((k, v) for k, v in tree.nsmap.items() if k)
    return tree.xpath(expr, namespaces=nsmap)

doc = '''<root xmlns="http://really-long-namespace.uri"
    xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>'''

tree = lxml.etree.fromstring(doc)
print xpath_ns(tree, '/root')
print xpath_ns(tree, '/root/elem')
print xpath_ns(tree, '/root/other:elem')

输出：

[<Element {http://really-long-namespace.uri}root at 23099f0>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]

更新：如果您发现需要解析XPath，可以查看像py-dom-xpath这样的项目，这是（大部分）XPath 1.0的纯Python实现。至少，这将让您了解解析XPath的复杂性。

Answer 2

首先，关于“你想做什么”：

未命名空格：/root/elem - ＆gt;没问题，我认为
文档中的命名空间前缀：/root/other:elem - ＆gt;好吧，这有点问题，你不能只使用“文档中的名称空间前缀”。即使在一个文件中：
- 命名空间元素甚至不一定有前缀
- 相同的前缀不一定总是映射到同一名称空间uri
- 相同的名称空间uri不一定总是具有相同的前缀

仅供参考：如果您想获取某个元素的范围内的前缀映射，请在lxml中尝试elem.nsmap。此外，lxml.etree中的iterparse and iterwalk方法可用于“通知”命名空间声明。

如何通过Python中的XPath以命名空间无关的方式查找XML元素？

我能做什么

我想做什么

解决

2 个答案: