我有一个xml文件,我需要打开并进行一些更改,其中一个更改是删除命名空间和前缀,然后保存到另一个文件。 这是xml:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer">
<provider>some data</provider>
<language>en-GB</language>
</package>
我可以进行其他所需的更改,但无法找到如何删除命名空间和前缀。这是我需要的reusklt xml:
<?xml version='1.0' encoding='UTF-8'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>
这是我的脚本,它将打开并解析xml并保存它:
metadata = '/Users/user1/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
tree.write('/Users/user1/Desktop/Python/done.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
那么如何在我的脚本中添加代码来删除命名空间和前缀?
答案 0 :(得分:26)
替换标签为Uku Loskit建议。除此之外,请使用lxml.objectify.deannotate。
from lxml import etree, objectify
metadata = '/Users/user1/Desktop/Python/metadata.xml'
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
<强>更新强>
Comment
等某些标记在访问tag
属性时会返回一个函数。为此加了一个警卫。 (1)
答案 1 :(得分:17)
>>> root.tag
'{http://latest/nmc-omc/cmNrm.doc#measCollec}measCollecFile'
>>> etree.QName(root.tag).localname
'measCollecFile'
附录:lxml.etree.QName
也接受有关构造的元素。因此etree.QName(root.tag).localname
等效于:
etree.QName(root).localname
答案 2 :(得分:4)
import xml.etree.ElementTree as ET
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}' % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]
metadata = '/Users/user1/Desktop/Python/metadata.xml'
tree = ET.parse(metadata)
root = tree.getroot()
remove_namespace(root, u'http://apple.com/itunes/importer')
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
使用了来自here的代码片段 通过搜索以“xmlns”
开头的标记,可以轻松扩展此方法以删除任何名称空间属性答案 3 :(得分:1)
您还可以使用XSLT剥离名称空间...
XSLT 1.0 (test.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*" priority="1">
<xsl:element name="{local-name()}" namespace="">
<xsl:apply-templates select="@*|node()"/>
</xsl:element>
</xsl:template>
<xsl:template match="@*">
<xsl:attribute name="{local-name()}" namespace="">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Python
from lxml import etree
tree = etree.parse("metadata.xml")
xslt = etree.parse("test.xsl")
new_tree = tree.xslt(xslt)
print(etree.tostring(new_tree, pretty_print=True, xml_declaration=True,
encoding="UTF-8").decode("UTF-8"))
输出
<?xml version='1.0' encoding='UTF-8'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>
答案 4 :(得分:1)
您可以尝试使用lxml:
# Remove namespace prefixes
for elem in root.getiterator():
namespace_removed = elem.xpath('local-name()')
答案 5 :(得分:0)
您需要做的就是:
objectify.deannotate(root, cleanup_namespaces=True)
使用root = tree.getroot()
答案 6 :(得分:0)
以下是另外两种删除命名空间的方法。第一个使用lxml.etree.QName助手,而第二个使用正则表达式。这两个函数都允许匹配的可选命名空间列表。如果未提供命名空间列表,则删除所有命名空间。属性键也被清理。
from lxml import etree
import re
def remove_namespaces_qname(doc, namespaces=None):
for el in doc.getiterator():
# clean tag
q = etree.QName(el.tag)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
el.tag = q.localname
else:
el.tag = q.localname
# clean attributes
for a, v in el.items():
q = etree.QName(a)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
del el.attrib[a]
el.attrib[q.localname] = v
else:
del el.attrib[a]
el.attrib[q.localname] = v
return doc
def remove_namespace_re(doc, namespaces=None):
if namespaces is not None:
ns = list(map(lambda n: u'{%s}' % n, namespaces))
for el in doc.getiterator():
# clean tag
m = re.match(r'({.+})(.+)', el.tag)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
el.tag = m.group(2)
else:
el.tag = m.group(2)
# clean attributes
for a, v in el.items():
m = re.match(r'({.+})(.+)', a)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
del el.attrib[a]
el.attrib[m.group(2)] = v
else:
del el.attrib[a]
el.attrib[m.group(2)] = v
return doc
答案 7 :(得分:0)
所以我意识到这是一个较旧的答案,具有很高的投票率和接受度,但是如果您正在阅读大文件并发现自己处于与我一样的困境;我希望这对您有所帮助。
实际上,这种方法的问题在于迭代。不管解析器有多快,做任何事情都会说......几十万次会消耗你的执行时间。话虽如此,归根结底是真正为我考虑问题并了解名称空间的工作原理(或“打算工作”,因为老实说不需要它们)。现在,如果您的 xml 真正 使用名称空间,这意味着您会看到如下所示的标记:<xs:table>
,那么您需要针对您的用例调整此处的方法。我也会包括完整的处理方式。
问题:命名空间剥离需要永远......而且大多数情况下,命名空间只存在于最开始的标签内,或者我们的“root”。因此,在考虑 python 如何读取信息,以及我们唯一的问题子节点是根节点时,为什么不利用它来发挥我们的优势。
请注意:我用作示例的文件是原始的、可怕的、非常无意义的 lulz 结构,并承诺在某处提供数据。
my_file
是我们示例中使用的文件的路径,由于专业原因我无法与您分享;为了通过这个答案,它的大小已经被缩小了。
import os, sys, subprocess, re, io, json
from lxml import etree
# Your file would be '_biggest_file' if playing along at home
my_file = _biggest_file
meta_stuff = dict(
exists = os.path.exists(_biggest_file),
sizeof = os.path.getsize(_biggest_file),
extension_is_a_real_thing = any(re.findall("\.(html|xml)$", my_file, re.I)),
system_thinks_its_a = subprocess.check_output(
["file", "-i", _biggest_file]
).decode().split(":")[-1:][0].strip()
)
print(json.dumps(meta_stuff, indent = 2))
所以对于初学者来说,大小合适,系统认为它充其量是 html;文件扩展名既不是 xml 也不是 html...
{
"exists": true,
"sizeof": 24442371,
"extension_is_a_real_thing": false,
"system_thinks_its_a": "text/html; charset=us-ascii"
}
方法:
def speed_read(file_path):
# We're gonna be low-brow and add our own using this string. It's fine
_xml_dec = '<?xml version="1.0" encoding="utf-8"?>'
# Even worse.. rgx for xml here we go
#
# We'll need to extract the very first node that we find in our document,
# because for our purposes thats the one we know has the namespace uri's
# ie: "attributes"
# FiRsT node : <actual_name xmlns:xsi="idontactuallydoanything.com">
# We're going to pluck out that first node, get the tags actual name
# which means from:
# <actual_name xmlns:xsi="idontactuallydoanything.com">...</actual_name>
# We pluck:
# actual_name
# Then we're gonna replace the entire tag with one we make from that name
# by simple string substitution
#
# -> 'starting from the beginning, capture everything between the < and the >'
_first_node = re.compile('^(\<.*?\>)', re.I|re.M|re.U)
# -> 'Starting from the beginning, but dont you get me the <, find anything that happens
# before the first white-space, which i don't want either man'
_first_tagname = re.compile('(?<=^\<)(.*?)\S+',re.I|re.M|re.U)
# open the file context
with open(file_path, "r", encoding = "utf-8") as f:
# go ahead and strip leading and trailing, cause why not... plus adds
# safety for our regex's
_raw = f.read().strip()
# Now, if the file somehow happens to magically have the xml declaration, we
# wanna go ahead and remove it as we plan to add our own. But for efficiency,
# only check the first couple of characters
if _raw.startswith('<?xml', 0, 5):
#_raw = re.sub(_xml_dec, '', _raw).strip()
_raw = re.sub('\<\?xml.*?\?>\n?', '', _raw).strip()
# Here we grab that first node that has those meaningless namespaces
root_element = _first_node.search(_raw).group()
# here we get its name
first_tag = _first_tagname.search(root_element).group()
# Here, we rubstitute the entire element, with a new one
# that only contains the elements name
_raw = re.sub(root_element, '<{}>'.format(first_tag), _raw)
# Now we add our declaration tag in the worst way you have ever
# seen, but I miss sprintf, so this is how i'm rolling. Python is terrible btw
_raw = "{}{}".format(_xml_dec, _raw)
# The bytes part here might end up being overkill.. but this has worked
# for me consistently so it stays.
return etree.parse(io.BytesIO(bytes(bytearray(_raw, encoding = "utf-8"))))
# a good answer from above:
def safe_read(file_path):
root = etree.parse(file_path)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
return root
import pandas as pd
safe_times = []
for i in range(0,5):
s = time.time()
safe_read(_biggest_file)
safe_times.append(time.time() - s)
fast_times = []
for i in range(0,5):
s = time.time()
speed_read(_biggest_file)
fast_times.append(time.time() - s)
pd.DataFrame({"safe":safe_times, "fast":fast_times})
安全 | 快速 |
---|---|
2.36 | 0.61 |
2.15 | 0.58 |
2.47 | 0.49 |
2.94 | 0.60 |
2.83 | 0.53 |
答案 8 :(得分:0)
在解析 XML 字符串后立即定义并调用以下函数:
from lxml import etree
def clean_xml_namespaces(root):
for element in root.getiterator():
if isinstance(element, etree._Comment):
continue
element.tag = etree.QName(element).localname
etree.cleanup_namespaces(root)
<块引用>
? 注意 - XML 中的注释元素被忽略,因为它们应该被忽略
用法:
xml_content = b'''<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<dependencies>
<dependency>
<groupId>org.easytesting</groupId>
<artifactId>fest-assert</artifactId>
<version>1.4</version>
</dependency>
<!-- this dependency is critical -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.4</version>
</dependency>
</dependencies>
</project>
'''
root = etree.fromstring(xml_content)
clean_xml_namespaces(root)
elements = root.findall(".//dependency")
print(len(elements))
# outputs "2", as expected