我正在尝试编写一个解析算法来有效地从xml文档中提取数据。我目前正在浏览基于元素和子元素的文档,但是想使用iterparse。一个问题是我有一个元素列表,当我找到它时,我想从它们中提取子数据,但似乎使用iterparse我的选项是根据一个元素名称进行过滤,或者获取每个元素。
示例xml:
<?xml version="1.0" encoding="UTF-8"?>
<data_object xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<source id="0">
<name>Office Issues</name>
<datetime>2012-01-13T16:09:15</datetime>
<data_id>7</data_id>
</source>
<event id="125">
<date>2012-11-06</date>
<state_id>7</state_id>
</event>
<state id="7">
<name>Washington</name>
</state>
<locality id="2">
<name>Olympia</name>
<state_id>7</state_id>
<type>City</type>
</locality>
<locality id="3">
<name>Town</name>
<state_id>7</state_id>
<type>Town</type>
</locality>
</data_object>
代码示例:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
context = iter(context)
event, root = context.next()
base = False
b_name = ""
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
base = True
bname = elem.tag
children = elem.getchildren()
child_list = []
for child in children:
child_list.append(child.tag)
print bname + ":" + str(child_list)
elif event == "end" and elem.tag in ELEMENT_LIST:
base = False
root.clear()
答案 0 :(得分:1)
使用iterparse
,您不能将解析限制为某些类型的标记,您只能使用一个标记(通过传递参数tag
)。但是,您可以轻松地手动完成您想要实现的目标。在以下代码段中:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
print "this elem is interesting, do some processing: %s: [%s]" % (elem.tag, ", ".join(child.tag for child in elem))
elem.clear()
您只能将搜索限制为有趣的标签。 iterparse
的重要部分是elem.clear()
,它在项目过时时清除内存。这就是内存效率的原因,请参阅http://lxml.de/parsing.html#modifying-the-tree
答案 1 :(得分:0)
我会改用XPath。它比你自己走文件要优雅得多,而且我假设效率更高。
答案 2 :(得分:0)
使用tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url'
与正确答案相似的问题https://stackoverflow.com/a/7019273/1346222
#!/usr/bin/python
# coding: utf-8
""" Parsing xml file. Basic example """
from StringIO import StringIO
from lxml import etree
import urllib2
sitemap = urllib2.urlopen(
'http://google.com/sitemap.xml',
timeout=10
).read()
NS = {
'x': 'http://www.sitemaps.org/schemas/sitemap/0.9',
'x2': 'http://www.google.com/schemas/sitemap-mobile/1.0'
}
res = []
urls = etree.iterparse(StringIO(sitemap), tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url')
for event, url in urls:
t = []
t = url.xpath('.//x:loc/text() | .//x:priority/text()', namespaces=NS)
t.append(url.xpath('boolean(.//x2:mobile)', namespaces=NS))
res.append(t)