要在python中访问特定节点,我会做这样的事情nodeZ = xmlDoc.find("X/Y/Z")
这对我很有用。
但是,当我在XML文件的开头处有以下两行时,我无法再选择或访问节点了。 我该怎么办?
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
BTW,我正在加载lxml包
更新:一个真正的例子
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
<raml version="2.0" xmlns="raml20.xsd">
<cmData type="actual">
<header>
<log dateTime="2015-10-13T15:57:06" action="created" appInfo="ActualExporter">InternalValues are used</log>
</header>
<managedObject class="MRBTS" version="XXX" distName="PLMN-PLMN/MRBTS-XXX" id="111">
</managedObject>
</cmData>
</raml>
我尝试通过执行以下操作来访问managedObject节点:
from lxml import etree
xmlDoc = etree.parse("D:/File.xml")
moNode = xmlDoc.find("cmData/managedObject")
正如我上面提到的,只有删除前两行才能正常工作。
答案 0 :(得分:0)
<强>尝试:强>
from lxml import etree, html
text = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
<raml version="2.0" xmlns="raml20.xsd">
<cmData type="actual">
<header>
<log dateTime="2015-10-13T15:57:06" action="created" appInfo="ActualExporter">InternalValues are used</log>
</header>
<managedObject class="MRBTS" version="XXX" distName="PLMN-PLMN/MRBTS-XXX" id="111">
</managedObject>
</cmData>
</raml>"""
html_code = etree.HTML(text)
result = etree.tostring(html_code, pretty_print=True, method="html")
tree = html.fromstring(result)
data = tree.xpath('//raml/cmdata/managedobject')[0]
managed_object_class = data.xpath('@class')[0]
managed_object_version = data.xpath('@version')[0]
managed_object_distname = data.xpath('@distname')[0]
managed_object_id = data.xpath('@id')[0]
print "Id: {}".format(managed_object_id)
print "Class: {}".format(managed_object_class)
print "Version: {}".format(managed_object_version)
print "DistName: {}".format(managed_object_distname)
<强>输出:强>
Id: 111
Class: MRBTS
Version: XXX
DistName: PLMN-PLMN/MRBTS-XXX
答案 1 :(得分:0)
我无法重现这个问题。前两行的存在与否(XML声明和文档类型声明)无关紧要。该元素根本找不到。
重要的是XML文档位于命名空间中。命名空间名称(raml20.xsd
)有点不寻常,但没关系。以下打印所需元素:
from lxml import etree
xmlDoc = etree.parse("File.xml")
moNode = xmlDoc.find("r:cmData/r:managedObject", namespaces={"r": "raml20.xsd"})
print moNode
在上面的代码中,使用了前缀(r
)。另一种方法是直接使用命名空间名称(括在花括号中):
moNode = xmlDoc.find("{raml20.xsd}cmData/{raml20.xsd}managedObject")
通配符也有效:
moNode = xmlDoc.find("{*}cmData/{*}managedObject")
在所有三种情况下,输出都是:
<Element {raml20.xsd}managedObject at 0x2787c60>