Question

我正在尝试删除xml文件中的节点。我已经设法得到那么远，但是当脚本运行时，它似乎采用了属于父元素的属性。

以下是代码：

for i, pid in enumerate(root.findall(".//p")):
   for cont in pid.findall('membercontribution'):
          for col in cont.findall('col'):
                 cont.remove(col)


tree.write('fofo.xml')

这个：

<p id="S6CV0001P0-00507"><member>The Minister for Overseas Development (Mr. Neil Marten)        
</member><membercontribution>: a policy
<col>16</col>
foobar barforb </membercontribution></p>

成为这个：

<p id="S6CV0001P0-00507"><member>The Minister for Overseas Development (Mr. Neil Marten)    
</member><membercontribution>: a policy </membercontribution></p>

我如何对此进行编码，以便保留后面的“foobar barforb”部分？

Answer 1

此处无意删除的内容不是属性，而是元素tail的内容。

tail属性是ElementTree API的特性。它是紧跟元素结束标记之后和任何其他标记之前的文本。删除元素（在本例中为col）时，也会删除其尾部。

我发现的最清楚的解释是：http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html。

要获取所需的输出，您需要保留对已删除的col元素尾部的引用，并将其附加到父元素的文本中。一个完整的例子：

from xml.etree import ElementTree as ET

XML = """
<root>
<p id="S6CV0001P0-00507"><member>The Minister for Overseas Development (Mr. Neil Marten)
</member><membercontribution>: a policy
<col>16</col>
foobar barforb </membercontribution></p>
</root>
"""

root = ET.fromstring(XML)

for pid in root.findall(".//p"):
    for cont in pid.findall('membercontribution'):
        for col in cont.findall('col'):
            col_tail = col.tail.strip()          # Get the tail of "col"
            cont.remove(col)                     # Remove "col"
            cont.text = cont.text.strip() + " "  # Replace trailing whitespace with single space
            cont.text = cont.text + col_tail     # Add the tail to "membercontribution"

print ET.tostring(root)

输出：

<root>
<p id="S6CV0001P0-00507"><member>The Minister for Overseas Development (Mr. Neil Marten)
</member><membercontribution>: a policy foobar barforb</membercontribution></p>
</root>

如何在不从元素尾部删除内容的情况下删除XML元素？

1 个答案: