这是我的用例: 我有一个可能很大的XML文件,我想输出给定元素类型的所有独特结构变体的频率。元素属性应作为唯一性测试的一部分包含在内。输出应按频率对变化进行排序。
这是一个简单的输入示例,有4个汽车条目:
<automobile>
<mileage>20192</mileage>
<year>2005</year>
<user_defined name="color">red</user_defined>
</automobile>
<automobile>
<mileage>1098</mileage>
<year>2018</year>
<user_defined name="color">blue</user_defined>
</automobile>
<automobile>
<mileage>17964</mileage>
<year>2012</year>
<user_defined name="title_status">salvage</user_defined>
</automobile>
<automobile>
<mileage>198026</mileage>
<year>1990</year>
</automobile>
我期望的输出看起来像这样:
<automobile automobile_frequency="2">
<mileage />
<year />
<user_defined name="color" />
</automobile>
<automobile automobile_frequency="1">
<mileage />
<year />
<user_defined name="title_status" />
</automobile>
<automobile automobile_frequency="1">
<mileage />
<year />
</automobile>
我使用iterparse实现了代码,但是当它处理元素时,元素中不存在属性。代码逻辑似乎是正确的,但属性根本不存在;它们不是在输出中写入的,并且它们不存在于唯一性测试中。根据上面的输入示例,这是我得到的输出:
<root>
<automobile automobile_frequency="3">
<mileage/>
<year/>
<user_defined/>
</automobile>
<automobile automobile_frequency="1">
<mileage/>
<year/>
</automobile>
</root>
用法是:
xplore.py input.xml node_to_explore
在上面的例子中,我使用了:
xplore.py trivial.xml automobile
以下是来源:
from lxml import etree
import sys
import re
from datetime import datetime
# global node signature map
structure_map = {}
# global code frequency map
frequency_map = {}
# output tree
tmp_root = etree.Element("tmp_root")
def process_element(el):
global target
if el.tag != target:
return
# get the structure of the element
structure = get_structure(el)
global structure_map
structure_key = etree.tostring(structure, pretty_print=True)
if structure_key not in structure_map.keys():
# add signature to structure map
structure_map[structure_key] = structure
# add node to output
global tmp_root
tmp_root.append(structure)
# add signature to frequency map
frequency_map[structure_key] = 1
else:
# increment frequency map
frequency_map[structure_key] += 1
# returns a unique string representing the structure of the node
# including attributes
def get_structure(el):
# create new element for the return value
ret = etree.Element(el.tag)
# get attributes
attribute_keys = el.attrib.keys()
for attribute_key in attribute_keys:
ret.set(attribute_key, el.get(attribute_key))
# check for children
children = list(el)
for child in children:
ret.append(get_structure(child))
return ret
if len(sys.argv) < 3:
print "Must specify an XML file for processing, as well as an element type!"
exit(0)
# Get XML file
xml = sys.argv[1]
# Get output file name
output_file = xml[0:xml.rindex(".")]+".txt"
# get target element type to evaluate
target = sys.argv[2]
# mark start
startTime = datetime.now()
# Parse XML
print '==========================='
print 'Parsing XML'
print '==========================='
context = etree.iterparse(xml, events=('end',))
for event, element in context:
process_element(element)
element.clear()
# create tree sorted by frequency
ranked = sorted(frequency_map.items(), key=lambda x: x[1], reverse=True)
root = etree.Element("root")
for item in ranked:
structure = structure_map[item[0]]
structure.set(target+"_frequency", str(item[1]))
root.append(structure)
# pretty print root
out = open(output_file, 'w')
out.write(etree.tostring(root, pretty_print=True))
# output run time
time = datetime.now() - startTime
reg3 = re.compile("\\d+:\\d(\\d:\\d+\\.\\d{4})")
time = re.search(reg3, unicode(time))
time = "Runtime: %ss" % (time.group(1).encode("utf-8"))
print(time)
在调试器中,我可以清楚地看到get_structure调用中的元素缺少属性。谁能告诉我为什么会这样呢?