使用lxml从python中的xml中删除命名空间和前缀

时间:2013-08-10 06:17:56

标签: python xml namespaces lxml

我有一个xml文件,我需要打开并进行一些更改,其中一个更改是删除命名空间和前缀,然后保存到另一个文件。 这是xml:

<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer">
  <provider>some data</provider>
  <language>en-GB</language>
</package>

我可以进行其他所需的更改,但无法找到如何删除命名空间和前缀。这是我需要的reusklt xml:

<?xml version='1.0' encoding='UTF-8'?>
<package>
  <provider>some data</provider>
  <language>en-GB</language>
</package>

这是我的脚本,它将打开并解析xml并保存它:

metadata = '/Users/user1/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
tree.write('/Users/user1/Desktop/Python/done.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')

那么如何在我的脚本中添加代码来删除命名空间和前缀?

9 个答案:

答案 0 :(得分:26)

替换标签为Uku Loskit建议。除此之外,请使用lxml.objectify.deannotate

from lxml import etree, objectify

metadata = '/Users/user1/Desktop/Python/metadata.xml'
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()

####    
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)
####

tree.write('/Users/user1/Desktop/Python/done.xml',
           pretty_print=True, xml_declaration=True, encoding='UTF-8')

<强>更新

Comment等某些标记在访问tag属性时会返回一个函数。为此加了一个警卫。 (1)

答案 1 :(得分:17)

>>> root.tag
'{http://latest/nmc-omc/cmNrm.doc#measCollec}measCollecFile'
>>> etree.QName(root.tag).localname
'measCollecFile'

source

附录:lxml.etree.QName也接受有关构造的元素。因此etree.QName(root.tag).localname等效于:

etree.QName(root).localname

答案 2 :(得分:4)

import xml.etree.ElementTree as ET
def remove_namespace(doc, namespace):
    """Remove namespace in the passed document in place."""
    ns = u'{%s}' % namespace
    nsl = len(ns)
    for elem in doc.getiterator():
        if elem.tag.startswith(ns):
            elem.tag = elem.tag[nsl:]

metadata = '/Users/user1/Desktop/Python/metadata.xml'
tree = ET.parse(metadata)
root = tree.getroot()

remove_namespace(root, u'http://apple.com/itunes/importer')
tree.write('/Users/user1/Desktop/Python/done.xml',
       pretty_print=True, xml_declaration=True, encoding='UTF-8')

使用了来自here的代码片段 通过搜索以“xmlns”

开头的标记,可以轻松扩展此方法以删除任何名称空间属性

答案 3 :(得分:1)

您还可以使用XSLT剥离名称空间...

XSLT 1.0 (test.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="*" priority="1">
    <xsl:element name="{local-name()}" namespace="">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="@*">
    <xsl:attribute name="{local-name()}" namespace="">
      <xsl:value-of select="."/>
    </xsl:attribute>
  </xsl:template>

</xsl:stylesheet>

Python

from lxml import etree

tree = etree.parse("metadata.xml")
xslt = etree.parse("test.xsl")

new_tree = tree.xslt(xslt)

print(etree.tostring(new_tree, pretty_print=True, xml_declaration=True, 
                     encoding="UTF-8").decode("UTF-8"))

输出

<?xml version='1.0' encoding='UTF-8'?>
<package>
  <provider>some data</provider>
  <language>en-GB</language>
</package>

答案 4 :(得分:1)

您可以尝试使用lxml:

# Remove namespace prefixes
for elem in root.getiterator():
    namespace_removed = elem.xpath('local-name()')

答案 5 :(得分:0)

您需要做的就是:

objectify.deannotate(root, cleanup_namespaces=True)

使用root = tree.getroot()

获取根目录后

答案 6 :(得分:0)

以下是另外两种删除命名空间的方法。第一个使用lxml.etree.QName助手,而第二个使用正则表达式。这两个函数都允许匹配的可选命名空间列表。如果未提供命名空间列表,则删除所有命名空间。属性键也被清理。

from lxml import etree
import re

def remove_namespaces_qname(doc, namespaces=None):

    for el in doc.getiterator():

        # clean tag
        q = etree.QName(el.tag)
        if q is not None:
            if namespaces is not None:
                if q.namespace in namespaces:
                    el.tag = q.localname
            else:
                el.tag = q.localname

            # clean attributes
            for a, v in el.items():
                q = etree.QName(a)
                if q is not None:
                    if namespaces is not None:
                        if q.namespace in namespaces:
                            del el.attrib[a]
                            el.attrib[q.localname] = v
                    else:
                        del el.attrib[a]
                        el.attrib[q.localname] = v
    return doc


def remove_namespace_re(doc, namespaces=None):

    if namespaces is not None:
        ns = list(map(lambda n: u'{%s}' % n, namespaces))

    for el in doc.getiterator():

        # clean tag
        m = re.match(r'({.+})(.+)', el.tag)
        if m is not None:
            if namespaces is not None:
                if m.group(1) in ns:
                    el.tag = m.group(2)
            else:
                el.tag = m.group(2)

            # clean attributes
            for a, v in el.items():
                m = re.match(r'({.+})(.+)', a)
                if m is not None:
                    if namespaces is not None:
                        if m.group(1) in ns:
                            del el.attrib[a]
                            el.attrib[m.group(2)] = v
                    else:
                        del el.attrib[a]
                        el.attrib[m.group(2)] = v
    return doc

答案 7 :(得分:0)

所以我意识到这是一个较旧的答案,具有很高的投票率和接受度,但是如果您正在阅读大文件并发现自己处于与我一样的困境;我希望这对您有所帮助。

实际上,这种方法的问题在于迭代。不管解析器有多快,做任何事情都会说......几十万次会消耗你的执行时间。话虽如此,归根结底是真正为我考虑问题并了解名称空间的工作原理(或“打算工作”,因为老实说不需要它们)。现在,如果您的 xml 真正 使用名称空间,这意味着您会看到如下所示的标记:<xs:table>,那么您需要针对您的用例调整此处的方法。我也会包括完整的处理方式。

免责声明:凭良心,我不能告诉你在解析 html/xml 时使用正则表达式,去看看 SergiyKolesnikov 的答案,因为它有效,但我有一个边缘案例,所以说......让我们深入研究一些正则表达式!

问题:命名空间剥离需要永远......而且大多数情况下,命名空间只存在于最开始的标签内,或者我们的“root”。因此,在考虑 python 如何读取信息,以及我们唯一的问题子节点是根节点时,为什么不利用它来发挥我们的优势。

请注意:我用作示例的文件是原始的、可怕的、非常无意义的 lulz 结构,并承诺在某处提供数据。

my_file 是我们示例中使用的文件的路径,由于专业原因我无法与您分享;为了通过这个答案,它的大小已经被缩小了。

import os, sys, subprocess, re, io, json
from lxml import etree

# Your file would be '_biggest_file' if playing along at home
my_file = _biggest_file
meta_stuff = dict(
    exists = os.path.exists(_biggest_file), 
    sizeof = os.path.getsize(_biggest_file),
    extension_is_a_real_thing = any(re.findall("\.(html|xml)$", my_file, re.I)),
    system_thinks_its_a = subprocess.check_output(
        ["file", "-i", _biggest_file]
    ).decode().split(":")[-1:][0].strip()
)


print(json.dumps(meta_stuff, indent = 2))

所以对于初学者来说,大小合适,系统认为它充其量是 html;文件扩展名既不是 xml 也不是 html...


{
  "exists": true,
  "sizeof": 24442371,
  "extension_is_a_real_thing": false,
  "system_thinks_its_a": "text/html; charset=us-ascii"
}

方法:

  1. 为了解析一个 xml 文件...它至少应该是 xml,所以我们需要检查并添加一个声明标签(如果不存在)
  2. 如果我有命名空间......那很糟糕,因为我不能使用 xpaths,这正是我想要做的
  3. 如果我的文件很大,我应该只在我准备好解析它之前需要清理的最小的可以想象的部分进行操作。

功能


def speed_read(file_path):

    # We're gonna be low-brow and add our own using this string. It's fine
    _xml_dec = '<?xml version="1.0" encoding="utf-8"?>'
    # Even worse.. rgx for xml here we go
    # 
    # We'll need to extract the very first node that we find in our document, 
    # because for our purposes thats the one we know has the namespace uri's 
    # ie: "attributes"
    #    FiRsT node : <actual_name xmlns:xsi="idontactuallydoanything.com">
    # We're going to pluck out that first node, get the tags actual name
    # which means from:
    #    <actual_name xmlns:xsi="idontactuallydoanything.com">...</actual_name>
    # We pluck:
    #    actual_name
    # Then we're gonna replace the entire tag with one we make from that name
    # by simple string substitution
    # 
    # -> 'starting from the beginning, capture everything between the < and the >'
    _first_node = re.compile('^(\<.*?\>)', re.I|re.M|re.U)
    # -> 'Starting from the beginning, but dont you get me the <, find anything that happens
    #     before the first white-space, which i don't want either man'
    _first_tagname = re.compile('(?<=^\<)(.*?)\S+',re.I|re.M|re.U)
    # open the file context
    with open(file_path, "r", encoding = "utf-8") as f:
        # go ahead and strip leading and trailing, cause why not... plus adds 
        # safety for our regex's
        _raw = f.read().strip()
        # Now, if the file somehow happens to magically have the xml declaration, we
        # wanna go ahead and remove it as we plan to add our own. But for efficiency, 
        # only check the first couple of characters
        if _raw.startswith('<?xml', 0, 5):
            #_raw = re.sub(_xml_dec, '', _raw).strip()
            _raw = re.sub('\<\?xml.*?\?>\n?', '', _raw).strip()
        # Here we grab that first node that has those meaningless namespaces
        root_element = _first_node.search(_raw).group()
        # here we get its name
        first_tag = _first_tagname.search(root_element).group()
        # Here, we rubstitute the entire element, with a new one
        # that only contains the elements name
        _raw = re.sub(root_element, '<{}>'.format(first_tag), _raw)
        # Now we add our declaration tag in the worst way you have ever
        # seen, but I miss sprintf, so this is how i'm rolling. Python is terrible btw
        _raw = "{}{}".format(_xml_dec, _raw)
        # The bytes part here might end up being overkill.. but this has worked 
        # for me consistently so it stays. 
        return etree.parse(io.BytesIO(bytes(bytearray(_raw, encoding = "utf-8"))))



# a good answer from above:

def safe_read(file_path):
    root = etree.parse(file_path)
    for elem in root.getiterator():
        elem.tag = etree.QName(elem).localname
    # Remove unused namespace declarations
    etree.cleanup_namespaces(root)
    return root

基准测试 - 是的,我知道有更好的方法可以做到这一点。

import pandas as pd

safe_times = []
for i in range(0,5):
    s = time.time()
    safe_read(_biggest_file)
    safe_times.append(time.time() - s)


fast_times = []
for i in range(0,5):
    s = time.time()
    speed_read(_biggest_file)
    fast_times.append(time.time() - s)


pd.DataFrame({"safe":safe_times, "fast":fast_times})

结果


<头>
安全 快速
2.36 0.61
2.15 0.58
2.47 0.49
2.94 0.60
2.83 0.53

答案 8 :(得分:0)

在解析 XML 字符串后立即定义并调用以下函数:

from lxml import etree

def clean_xml_namespaces(root):
    for element in root.getiterator():
        if isinstance(element, etree._Comment):
            continue
        element.tag = etree.QName(element).localname
    etree.cleanup_namespaces(root)
<块引用>

? 注意 - XML 中的注释元素被忽略,因为它们应该被忽略

用法:

xml_content = b'''<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <dependencies>
        <dependency>
            <groupId>org.easytesting</groupId>
            <artifactId>fest-assert</artifactId>
            <version>1.4</version>
        </dependency>

        <!-- this dependency is critical -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.4</version>
        </dependency>
    </dependencies>
</project>
'''

root = etree.fromstring(xml_content)
clean_xml_namespaces(root) 
elements = root.findall(".//dependency")
print(len(elements)) 
# outputs "2", as expected