如何将空树节点作为空字符串写入xml文件

时间:2016-06-23 22:37:44

标签: python xml

我想删除某个标记值的元素,然后写出.xml文件,不包含任何已删除元素的标记;是我创建新树的唯一选择吗?

删除/删除元素有两个选项:

  

clear()   重置元素。此功能删除所有子元素,清除所有子元素   属性,并将text和tail属性设置为None。

起初我使用了它,它的作用是从元素中删除数据,但我仍然留下一个空元素:

# Remove all elements from the tree that are NOT "job" or "make" or "build" elements
log = open("debug.log", "w")
for el in root.iter(*):

    if el.tag != "job" and el.tag != "make" and el.tag != "build":
        print("removed = ", el.tag, el.attrib, file=log)
        el.clear()
    else:
        print("NOT", el.tag, el.attrib, file=log)

log.close()
tree.write("make_and_job_tree.xml", short_empty_elements=False)

问题在于xml.etree.ElementTree.ElementTree.write() still writes out empty tags no matter what:

  

...仅限关键字的short_empty_elements参数控制   格式化不包含内容的元素。如果为True(默认值),   它们是作为单个自闭标签发出的,否则就是   作为一对开始/结束标记发布

为什么没有打印出那些空标签的选项!不管。

那么我想我可能会尝试

  

remove(subelement)   从元素中删除子元素。与find *方法不同   method比较基于实例标识的元素,而不是标记   价值或内容。

但这仅适用于子元素。

所以我必须do something like

for el in root.iter(*):
    for subel in el:
        if subel.tag != "make" and subel.tag != "job" and subel.tag != "build":
            el.remove(subel)

但这里有一个很大的问题:我通过删除元素来使迭代器无效,对吗?

通过添加if subel来简单检查元素是否为空是否足够?:

if subel and subel.tag != "make" and subel.tag != "job" and subel.tag != "build"

或者每次我使树元素失效时,我是否必须获得一个新的迭代器?

请记住:我只是想写出没有空元素标签的xml文件。

这是一个例子。

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

假设我想删除对neighbor的任何提及。 理想情况下,删除后我想要这个输出:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
    </country>
</data>

问题是,当我使用clear()运行代码(参见上面的第一个代码块)并将其写入文件时,我得到了这个:

<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor></neighbor><neighbor></neighbor></country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor></neighbor></country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor></neighbor><neighbor></neighbor></country>
</data>

注意neighbor仍然出现。

我知道我可以轻松地在输出上运行正则表达式但是必须有一种方法(或其他Python api)可以动态执行此操作,而不是要求我再次触摸我的.xml文件。

3 个答案:

答案 0 :(得分:3)

import lxml.etree as et

xml  = et.parse("test.xml")

for node in xml.xpath("//neighbor"):
    node.getparent().remove(node)


xml.write("out.xml",encoding="utf-8",xml_declaration=True)

使用elementTree,我们需要找到 parents of the neighbor nodes ,然后找到 neighbor nodes inside that parent 并将其删除:

from xml.etree import ElementTree as et

xml  = et.parse("test.xml")


for parent in xml.getroot().findall(".//neighbor/.."):
      for child in parent.findall("./neighbor"):
          parent.remove(child)


xml.write("out.xml",encoding="utf-8",xml_declaration=True)

两者都会给你:

<?xml version='1.0' encoding='utf-8'?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        </country>
</data>

使用属性逻辑并修改xml,如下所示:

x = """<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
           <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>"""

使用lxml:

import lxml.etree as et

xml = et.fromstring(x)

for node in xml.xpath("//neighbor[not(@make) and not(@job) and not(@make)]"):
    node.getparent().remove(node)
print(et.tostring(xml))

会给你:

 <data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
        </country>
</data>

ElementTree中的相同逻辑:

from xml.etree import ElementTree as et

xml = et.parse("test.xml").getroot()

atts = {"build", "job", "make"}

for parent in xml.findall(".//neighbor/.."):
    for child in parent.findall(".//neighbor")[:]:
        if not atts.issubset(child.attrib):
            parent.remove(child)

如果你使用iter:

from xml.etree import ElementTree as et

xml = et.parse("test.xml")

for parent in xml.getroot().iter("*"):
    parent[:] = (child for child in parent if child.tag != "neighbor")

你可以看到我们得到完全相同的输出:

In [30]: !cat /home/padraic/untitled6/test.xml
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">#
      <neighbor name="Austria" direction="E"/>
        <rank>1</rank>
        <neighbor name="Austria" direction="E"/>
        <year>2008</year>
      <neighbor name="Austria" direction="E"/>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
In [31]: paste
def test():
    import lxml.etree as et
    xml = et.parse("/home/padraic/untitled6/test.xml")
    for node in xml.xpath("//neighbor"):
        node.getparent().remove(node)
    a = et.tostring(xml)
    from xml.etree import ElementTree as et
    xml = et.parse("/home/padraic/untitled6/test.xml")
    for parent in xml.getroot().iter("*"):
        parent[:] = (child for child in parent if child.tag != "neighbor")
    b = et.tostring(xml.getroot())
    assert  a == b

## -- End pasted text --

In [32]: test()

答案 1 :(得分:1)

每当需要修改XML文档时,还要考虑XSLT,它是包含XPath的XSL系列的特殊用途语言部分。 XSLT专门用于转换XML文件。 Pythoners不会很快推荐它,但它避免了通用代码中循环或嵌套if / then逻辑的需要。 Python的lxml模块可以使用libxslt处理器运行XSLT 1.0脚本。

在转换下运行身份转换以按原样复制文档,然后在<neighbor>上运行空模板匹配以将其删除:

XSLT 脚本(保存为.xsl文件,就像源.xml一样加载,两者都是格式正确的xml文件)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM TO COPY XML AS IS -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- EMPTY TEMPLATE TO REMOVE NEIGHBOR WHEREVER IT EXISTS -->  
  <xsl:template match="neighbor"/>

</xsl:transform>

Python 脚本

import lxml.etree as et

# LOAD XML AND XSL DOCUMENTS
xml  = et.parse("Input.xml")
xslt = et.parse("Script.xsl")

# TRANSFORM TO NEW TREE
transform = et.XSLT(xslt)
newdom = transform(xml)

# CONVERT TO STRING
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)

# OUTPUT TO FILE
xmlfile = open('Output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()

答案 2 :(得分:1)

这里的技巧是找到父(国家节点),并从那里删除邻居。在这个例子中,我使用的是ElementTree,因为我对它有点熟悉:

import xml.etree.ElementTree as ET

if __name__ == '__main__':
    with open('debug.log') as f:
        doc = ET.parse(f)

        for country in doc.findall('.//country'):
            for neighbor in country.findall('neighbor'):
                country.remove(neighbor)

        ET.dump(doc)  # Display