如何从Python中删除XML中的子节点?

时间:2015-07-22 21:55:07

标签: xml python-2.7 elementtree removechild

我有一个节点列表,我想从xml文档中删除它们。但是我在删除元素并将修改后的文档写入新的xml文件时遇到了问题。

这是我编写的一个python程序[我正在使用elementTree]

from xml.etree.ElementTree import ElementTree
    tree = ElementTree()
    tree.parse('autogen_test.xml')
    root = tree.getroot()
    keeper_data = ['4294905264']
    instances = tree.findall('./DIMENSION/DIMENSION_NODE/DIMENSION_NODE')
    removeList = list()
    for instance in instances:
        #print instance
        data1 = instance.find('./DVAL/DVAL_ID')
        if data1.attrib.get("ID") not in keeper_data:
            removeList.append(instance)
    for tag in removeList:
        parent = tree.findall('./DIMENSION/DIMENSION_NODE/DIMENSION_NODE')
        parent.remove(tag)    
tree.write("out.xml")

我的样本xml如下[这是一个标准输入,我无法修改它]

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DIMENSIONS SYSTEM "dimensions.dtd">
<DIMENSIONS>
   <NUM_DVALS>88816</NUM_DVALS>
   <DIMENSION NAME="Brand" SRC_FILE="" SRC_TYPE="INTERNAL">
      <DIMENSION_ID ID="4294905334"/>
      <DIMENSION_NODE>
         <DVAL TYPE="EXACT">
            <DVAL_ID ID="2"/>
            <SYN DISPLAY="TRUE" SEARCH="FALSE" CLASSIFY="FALSE">Brand</SYN>
         </DVAL>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905325"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">hanes</SYN>
            </DVAL>
         </DIMENSION_NODE>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905315"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">lee</SYN>
            </DVAL>
         </DIMENSION_NODE>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905281"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">levi's</SYN>
            </DVAL>
         </DIMENSION_NODE>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905264"/>
               <SYN DISPLAY="TRUE" SEARCH="TRUE" CLASSIFY="TRUE">braun</SYN>
            </DVAL>
         </DIMENSION_NODE>
        </DIMENSION_NODE>
   </DIMENSION>
   </DIMENSIONS>

即使在遍历列表并找到要删除的所有节点之后也是如此。 tree.write(“out.xml”)总是打印出原始的xml。基本上我需要删除原始xml中标识的内容。

预期产出:

<DIMENSIONS>
   <NUM_DVALS>88816</NUM_DVALS>
   <DIMENSION NAME="Brand" SRC_FILE="" SRC_TYPE="INTERNAL">
      <DIMENSION_ID ID="4294905334" />
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905264" />
               <SYN CLASSIFY="TRUE" DISPLAY="TRUE" SEARCH="TRUE">braun</SYN>
            </DVAL>
         </DIMENSION_NODE>
        </DIMENSION_NODE>
   </DIMENSION>
   </DIMENSIONS>

1 个答案:

答案 0 :(得分:1)

要删除的所有DIMENSION_NODE共享同一个父DIMENSION_NODE,因此在循环遍历removeList之前只获得一次更高效。更重要的是,您希望获得父DIMENSION_NODE而不是子DIMENSION_NODE,因此正确的XPath是./DIMENSION/DIMENSION_NODE。简而言之,尝试使用以下代码更改第二个for循环:

parent = tree.find('./DIMENSION/DIMENSION_NODE')
for tag in removeList:
    parent.remove(tag)  

这是演示的完整工作示例(只需要用实际的XML替换source值):

import xml.etree.ElementTree as ET

source = """replace with the XML in question"""

root = ET.fromstring(source)
keeper_data = ['4294905264']
instances = root.findall('.//DIMENSION/DIMENSION_NODE/DIMENSION_NODE')
removeList = list()
for instance in instances:
    data1 = instance.find('./DVAL/DVAL_ID')
    if data1.attrib.get("ID") not in keeper_data:
        removeList.append(instance)
parent = root.find('.//DIMENSION/DIMENSION_NODE')
for tag in removeList:
    parent.remove(tag)

print(ET.tostring(root))

将有问题的XML视为source变量的值,输出为:

<DIMENSIONS>
   <NUM_DVALS>88816</NUM_DVALS>
   <DIMENSION NAME="Brand" SRC_FILE="" SRC_TYPE="INTERNAL">
      <DIMENSION_ID ID="4294905334" />
      <DIMENSION_NODE>
         <DVAL TYPE="EXACT">
            <DVAL_ID ID="2" />
            <SYN CLASSIFY="FALSE" DISPLAY="TRUE" SEARCH="FALSE">Brand</SYN>
         </DVAL>
         <DIMENSION_NODE>
            <DVAL TYPE="EXACT">
               <DVAL_ID ID="4294905264" />
               <SYN CLASSIFY="TRUE" DISPLAY="TRUE" SEARCH="TRUE">braun</SYN>
            </DVAL>
         </DIMENSION_NODE>
        </DIMENSION_NODE>
   </DIMENSION>
</DIMENSIONS>