如何使用lxml遍历xml数据以删除下一个重复元素

时间:2015-08-19 13:49:40

标签: python xml lxml

我正在努力想出一个简单的解决方案,它迭代xml数据以移除下一个元素,如果它是实际的一个复制品。

示例:

来自这个“输入”:

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

我想谈谈这个“输出”:

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
</root>'''

为此,我想出了以下代码:

from lxml import etree
from io import StringIO


xml = '''
<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>'''

# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))

# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()

# iterate over all "b" elements
for element in root.iter('b'):
    # checks if the last "b" element has been reached.
    # on last element it raises "AttributeError" eception and terminates the for loop
    try:
        # attributes of actual element
        elem_attrib_ACT = element.attrib
        # attributes of next element
        elem_attrib_NEXT = element.getnext().attrib
    except AttributeError:
        # if no other element, break
        break
    print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
    if elem_attrib_ACT == elem_attrib_NEXT:
        print('next elem is duplicate of actual one -> remove it')
        # I would like to remove next element but this approach is not working
        # if you uncomment, it removes the elements of "data2" but stops
        # how to remove the next duplicate element?
        #element.getparent().remove(element.getnext())
    else:
        print('next elem is not a duplicate of actual')

print('result:')
print(etree.tostring(root))

取消注释行

#element.getparent().remove(element.getnext())

删除“data2”周围的元素但停止执行。得到的xml就是这个:

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

我的印象是我“切断了我所坐的分支”......

如何解决这个问题?

1 个答案:

答案 0 :(得分:2)

我认为你的怀疑是正确的,如果你在except块之前放置一个打印声明,你可以看到它早期破坏,因为这个元素已被删除(我认为)

<b attrib1="abc" attrib2="def">
    <c>data2</c>
</b>

尝试使用getprevious()代替getnext()。我还更新了使用列表推导来避免第一个元素上的错误(当然会在.getprevious()引发异常):

for element in [e for e in root.iter('b')][1:]:
    try:
        if element.getprevious().attrib == element.attrib:
            element.getparent().remove(element)
    except:
        print 'except  '
print etree.tostring(root)

结果:

<root>
<b attrib1="abc" attrib2="def">
    <c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
    <c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
    <c>data4</c>
</b>
</root>