我正在尝试打开xml文件,并从某些标记中获取值。我做了很多,但这个特殊的xml给了我一些问题。以下是xml文件的一部分:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer" version="film4.7">
<provider>filmgroup</provider>
<language>en-GB</language>
<actor name="John Smith" display="Doe John"</actor>
</package>
以下是我的python代码示例:
metadata = '/Users/mylaptop/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
for element in root.iter(tag='provider'):
providerValue = tree.find('//provider')
providerValue = providerValue.text
print providerValue
tree.write('/Users/mylaptop/Desktop/Python/metadataDone.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
当我运行它时,找不到提供者标签或其值。如果我删除xmlns="http://apple.com/itunes/importer"
,那么所有工作都按预期进行。
我的问题是如何删除这个命名空间,因为我对此并不感兴趣,所以我可以使用lxml获取我需要的标记值?
答案 0 :(得分:10)
provider
标记位于http://apple.com/itunes/importer
命名空间中,因此您需要使用完全限定名称
{http://apple.com/itunes/importer}provider
或使用其中一个the namespaces
parameter的lxml方法,例如root.xpath
。然后,您可以使用命名空间前缀(例如ns:provider
)指定它:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
namespaces = {'ns':'http://apple.com/itunes/importer'}
items = iter(root.xpath('//ns:provider/text()|//ns:actor/@name',
namespaces=namespaces))
for provider, actor in zip(*[items]*2):
print(provider, actor)
产量
('filmgroup', 'John Smith')
请注意,上面使用的XPath假定<provider>
和<actor>
元素始终显示为交替显示。如果不是这样,那么当然有办法处理它,但代码变得更加冗长:
for package in root.xpath('//ns:package', namespaces=namespaces):
for provider in package.xpath('ns:provider', namespaces=namespaces):
providerValue = provider.text
print providerValue
for actor in package.xpath('ns:actor', namespaces=namespaces):
print actor.attrib['name']
答案 1 :(得分:1)
我的建议是不要忽略命名空间,而是要考虑它。我为django-quickbooks库上的工作编写了一些相关的函数(经过轻微修改后复制)。有了这些功能,您应该可以这样做:
providers = getels(root, 'provider', ns='http://apple.com/itunes/importer')
以下是这些功能:
def get_tag_with_ns(tag_name, ns):
return '{%s}%s' % (ns, tag_name)
def getel(elt, tag_name, ns=None):
""" Gets the first tag that matches the specified tag_name taking into
account the QB namespace.
:param ns: The namespace to use if not using the default one for
django-quickbooks.
:type ns: string
"""
res = elt.find(get_tag_with_ns(tag_name, ns=ns))
if res is None:
raise TagNotFound('Could not find tag by name "%s"' % tag_name)
return res
def getels(elt, *path, **kwargs):
""" Gets the first set of elements found at the specified path.
Example:
>>> xml = (
"<root>" +
"<item>" +
"<id>1</id>" +
"</item>" +
"<item>" +
"<id>2</id>"* +
"</item>" +
"</root>")
>>> el = etree.fromstring(xml)
>>> getels(el, 'root', 'item', ns='correct/namespace')
[<Element item>, <Element item>]
"""
ns = kwargs['ns']
i=-1
for i in range(len(path)-1):
elt = getel(elt, path[i], ns=ns)
tag_name = path[i+1]
return elt.findall(get_tag_with_ns(tag_name, ns=ns))