Question

E.g。考虑解析pom.xml文件：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <parent>
        <groupId>com.parent</groupId>
        <artifactId>parent</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <modelVersion>2.0.0</modelVersion>
    <groupId>com.parent.somemodule</groupId>
    <artifactId>some_module</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>Some Module</name>
    ...

代码：

import xml.etree.ElementTree as ET

tree = ET.parse(pom)
root = tree.getroot()

groupId = root.find("groupId")
artifactId = root.find("artifactId")

groupId和artifactId都是None。为何他们是根的直接后裔？我尝试将root替换为tree（groupId = tree.find("groupId")），但这并未改变任何内容。

Answer 1

问题是你没有有一个名为groupId的孩子，你有一个名为{http://maven.apache.org/POM/4.0.0}groupId的孩子，因为etree不会忽略XML名称空间，它使用“通用名称”。请参阅effbot docs中的Working with Namespaces and Qualified Names。

Answer 2

只是为了扩展abarnert对BeautifulSoup的评论，如果你只想快速而肮脏地解决问题，这可能是最快的方法。我已经实现了这个（使用个人脚本）使用bs4，你可以用

遍历树

element = dom.getElementsByTagNameNS('*','elementname')

这将使用任何命名空间引用dom，如果你知道你只有一个文件中有一个，那么就不会有歧义。

使用xml.etree.ElementTree在Python中进行简单的dom遍历

2 个答案: