使用iterparse / lxml / python 2解析XML时缺少元素属性

时间:2018-03-12 22:35:07

标签: python xml lxml iterparse

这是我的用例: 我有一个可能很大的XML文件,我想输出给定元素类型的所有独特结构变体的频率。元素属性应作为唯一性测试的一部分包含在内。输出应按频率对变化进行排序。

这是一个简单的输入示例,有4个汽车条目:

<automobile>
    <mileage>20192</mileage>
    <year>2005</year>
    <user_defined name="color">red</user_defined>
</automobile>
<automobile>
    <mileage>1098</mileage>
    <year>2018</year>
    <user_defined name="color">blue</user_defined>
</automobile>
<automobile>
    <mileage>17964</mileage>
    <year>2012</year>
    <user_defined name="title_status">salvage</user_defined>
</automobile>
<automobile>
    <mileage>198026</mileage>
    <year>1990</year>
</automobile>

我期望的输出看起来像这样:

<automobile automobile_frequency="2">
    <mileage />
    <year />
    <user_defined name="color" />
</automobile>
<automobile automobile_frequency="1">
    <mileage />
    <year />
    <user_defined name="title_status" />
</automobile>
<automobile automobile_frequency="1">
    <mileage />
    <year />
</automobile>

我使用iterparse实现了代码,但是当它处理元素时,元素中不存在属性。代码逻辑似乎是正确的,但属性根本不存在;它们不是在输出中写入的,并且它们不存在于唯一性测试中。根据上面的输入示例,这是我得到的输出:

<root>
  <automobile automobile_frequency="3">
    <mileage/>
    <year/>
    <user_defined/>
  </automobile>
  <automobile automobile_frequency="1">
    <mileage/>
    <year/>
  </automobile>
</root>

用法是:

xplore.py input.xml node_to_explore 

在上面的例子中,我使用了:

xplore.py trivial.xml automobile

以下是来源:

from lxml import etree
import sys
import re
from datetime import datetime


# global node signature map
structure_map = {}
# global code frequency map
frequency_map = {}
# output tree
tmp_root = etree.Element("tmp_root")


def process_element(el):
    global target
    if el.tag != target:
        return
    # get the structure of the element
    structure = get_structure(el)
    global structure_map
    structure_key = etree.tostring(structure, pretty_print=True)
    if structure_key not in structure_map.keys():
        # add signature to structure map
        structure_map[structure_key] = structure
        # add node to output
        global tmp_root
        tmp_root.append(structure)
        # add signature to frequency map
        frequency_map[structure_key] = 1
    else:
        # increment frequency map
        frequency_map[structure_key] += 1


# returns a unique string representing the structure of the node
# including attributes
def get_structure(el):
    # create new element for the return value
    ret = etree.Element(el.tag)
    # get attributes
    attribute_keys = el.attrib.keys()
    for attribute_key in attribute_keys:
        ret.set(attribute_key, el.get(attribute_key))
    # check for children
    children = list(el)
    for child in children:
        ret.append(get_structure(child))
    return ret


if len(sys.argv) < 3:
    print "Must specify an XML file for processing, as well as an element type!"
    exit(0)

# Get XML file
xml = sys.argv[1]
# Get output file name
output_file = xml[0:xml.rindex(".")]+".txt"
# get target element type to evaluate
target = sys.argv[2]
# mark start
startTime = datetime.now()
# Parse XML

print '==========================='
print 'Parsing XML'
print '==========================='
context = etree.iterparse(xml, events=('end',))
for event, element in context:
    process_element(element)
    element.clear()
# create tree sorted by frequency
ranked = sorted(frequency_map.items(), key=lambda x: x[1], reverse=True)
root = etree.Element("root")
for item in ranked:
    structure = structure_map[item[0]]
    structure.set(target+"_frequency", str(item[1]))
    root.append(structure)
# pretty print root
out = open(output_file, 'w')
out.write(etree.tostring(root, pretty_print=True))
# output run time
time = datetime.now() - startTime
reg3 = re.compile("\\d+:\\d(\\d:\\d+\\.\\d{4})")
time = re.search(reg3, unicode(time))
time = "Runtime: %ss" % (time.group(1).encode("utf-8"))
print(time)

在调试器中,我可以清楚地看到get_structure调用中的元素缺少属性。谁能告诉我为什么会这样呢?

0 个答案:

没有答案