我正在努力想出一个简单的解决方案,它迭代xml数据以移除下一个元素,如果它是实际的一个复制品。
示例:
来自这个“输入”:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
我想谈谈这个“输出”:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>'''
为此,我想出了以下代码:
from lxml import etree
from io import StringIO
xml = '''
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>'''
# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))
# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()
# iterate over all "b" elements
for element in root.iter('b'):
# checks if the last "b" element has been reached.
# on last element it raises "AttributeError" eception and terminates the for loop
try:
# attributes of actual element
elem_attrib_ACT = element.attrib
# attributes of next element
elem_attrib_NEXT = element.getnext().attrib
except AttributeError:
# if no other element, break
break
print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
if elem_attrib_ACT == elem_attrib_NEXT:
print('next elem is duplicate of actual one -> remove it')
# I would like to remove next element but this approach is not working
# if you uncomment, it removes the elements of "data2" but stops
# how to remove the next duplicate element?
#element.getparent().remove(element.getnext())
else:
print('next elem is not a duplicate of actual')
print('result:')
print(etree.tostring(root))
取消注释行
#element.getparent().remove(element.getnext())
删除“data2”周围的元素但停止执行。得到的xml就是这个:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
我的印象是我“切断了我所坐的分支”......
如何解决这个问题?
答案 0 :(得分:2)
我认为你的怀疑是正确的,如果你在except
块之前放置一个打印声明,你可以看到它早期破坏,因为这个元素已被删除(我认为)
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
尝试使用getprevious()
代替getnext()
。我还更新了使用列表推导来避免第一个元素上的错误(当然会在.getprevious()
引发异常):
for element in [e for e in root.iter('b')][1:]:
try:
if element.getprevious().attrib == element.attrib:
element.getparent().remove(element)
except:
print 'except '
print etree.tostring(root)
结果:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>