两行阻止使用Python到达特定的XML节点

时间:2015-11-06 16:39:16

标签: python xml python-2.7 lxml

要在python中访问特定节点,我会做这样的事情nodeZ = xmlDoc.find("X/Y/Z")这对我很有用。

但是,当我在XML文件的开头处有以下两行时,我无法再选择或访问节点了。 我该怎么办?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>

BTW,我正在加载lxml包

更新:一个真正的例子

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
<raml version="2.0" xmlns="raml20.xsd">
  <cmData type="actual">
    <header>
      <log dateTime="2015-10-13T15:57:06" action="created" appInfo="ActualExporter">InternalValues are used</log>
    </header>
    <managedObject class="MRBTS" version="XXX" distName="PLMN-PLMN/MRBTS-XXX" id="111">
    </managedObject>
  </cmData>
</raml>

我尝试通过执行以下操作来访问managedObject节点:

from lxml import etree
xmlDoc = etree.parse("D:/File.xml")
moNode = xmlDoc.find("cmData/managedObject")

正如我上面提到的,只有删除前两行才能正常工作。

2 个答案:

答案 0 :(得分:0)

<强>尝试:

from lxml import etree, html

text = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
<raml version="2.0" xmlns="raml20.xsd">
  <cmData type="actual">
    <header>
      <log dateTime="2015-10-13T15:57:06" action="created" appInfo="ActualExporter">InternalValues are used</log>
    </header>
    <managedObject class="MRBTS" version="XXX" distName="PLMN-PLMN/MRBTS-XXX" id="111">
    </managedObject>
  </cmData>
</raml>"""

html_code = etree.HTML(text)
result = etree.tostring(html_code, pretty_print=True, method="html")
tree = html.fromstring(result)

data = tree.xpath('//raml/cmdata/managedobject')[0]

managed_object_class = data.xpath('@class')[0]
managed_object_version = data.xpath('@version')[0]
managed_object_distname = data.xpath('@distname')[0]
managed_object_id = data.xpath('@id')[0]

print "Id: {}".format(managed_object_id)
print "Class: {}".format(managed_object_class)
print "Version: {}".format(managed_object_version)
print "DistName: {}".format(managed_object_distname)

<强>输出:

Id: 111
Class: MRBTS
Version: XXX
DistName: PLMN-PLMN/MRBTS-XXX

答案 1 :(得分:0)

我无法重现这个问题。前两行的存在与否(XML声明和文档类型声明)无关紧要。该元素根本找不到。

重要的是XML文档位于命名空间中。命名空间名称(raml20.xsd)有点不寻常,但没关系。以下打印所需元素:

from lxml import etree

xmlDoc = etree.parse("File.xml")
moNode = xmlDoc.find("r:cmData/r:managedObject", namespaces={"r": "raml20.xsd"})
print moNode

在上面的代码中,使用了前缀(r)。另一种方法是直接使用命名空间名称(括在花括号中):

moNode = xmlDoc.find("{raml20.xsd}cmData/{raml20.xsd}managedObject")

通配符也有效:

moNode = xmlDoc.find("{*}cmData/{*}managedObject")

在所有三种情况下,输出都是:

<Element {raml20.xsd}managedObject at 0x2787c60>

更多信息:http://lxml.de/tutorial.html#namespaces