我想删除某个标记值的元素,然后写出.xml
文件,不包含任何已删除元素的标记;是我创建新树的唯一选择吗?
删除/删除元素有两个选项:
clear() 重置元素。此功能删除所有子元素,清除所有子元素 属性,并将text和tail属性设置为None。
起初我使用了它,它的作用是从元素中删除数据,但我仍然留下一个空元素:
# Remove all elements from the tree that are NOT "job" or "make" or "build" elements
log = open("debug.log", "w")
for el in root.iter(*):
if el.tag != "job" and el.tag != "make" and el.tag != "build":
print("removed = ", el.tag, el.attrib, file=log)
el.clear()
else:
print("NOT", el.tag, el.attrib, file=log)
log.close()
tree.write("make_and_job_tree.xml", short_empty_elements=False)
问题在于xml.etree.ElementTree.ElementTree.write()
still writes out empty tags no matter what:
...仅限关键字的short_empty_elements参数控制 格式化不包含内容的元素。如果为True(默认值), 它们是作为单个自闭标签发出的,否则就是 作为一对开始/结束标记发布。
为什么没有打印出那些空标签的选项!不管。
那么我想我可能会尝试
remove(subelement) 从元素中删除子元素。与find *方法不同 method比较基于实例标识的元素,而不是标记 价值或内容。
但这仅适用于子元素。
所以我必须do something like:
for el in root.iter(*):
for subel in el:
if subel.tag != "make" and subel.tag != "job" and subel.tag != "build":
el.remove(subel)
但这里有一个很大的问题:我通过删除元素来使迭代器无效,对吗?
通过添加if subel
来简单检查元素是否为空是否足够?:
if subel and subel.tag != "make" and subel.tag != "job" and subel.tag != "build"
或者每次我使树元素失效时,我是否必须获得一个新的迭代器?
请记住:我只是想写出没有空元素标签的xml文件。
这是一个例子。
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
假设我想删除对neighbor
的任何提及。
理想情况下,删除后我想要这个输出:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
</country>
</data>
问题是,当我使用clear()运行代码(参见上面的第一个代码块)并将其写入文件时,我得到了这个:
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor></neighbor><neighbor></neighbor></country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor></neighbor></country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor></neighbor><neighbor></neighbor></country>
</data>
注意neighbor
仍然出现。
我知道我可以轻松地在输出上运行正则表达式但是必须有一种方法(或其他Python api)可以动态执行此操作,而不是要求我再次触摸我的.xml
文件。
答案 0 :(得分:3)
import lxml.etree as et
xml = et.parse("test.xml")
for node in xml.xpath("//neighbor"):
node.getparent().remove(node)
xml.write("out.xml",encoding="utf-8",xml_declaration=True)
使用elementTree,我们需要找到 parents of the neighbor nodes
,然后找到 neighbor nodes inside that parent
并将其删除:
from xml.etree import ElementTree as et
xml = et.parse("test.xml")
for parent in xml.getroot().findall(".//neighbor/.."):
for child in parent.findall("./neighbor"):
parent.remove(child)
xml.write("out.xml",encoding="utf-8",xml_declaration=True)
两者都会给你:
<?xml version='1.0' encoding='utf-8'?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
</country>
</data>
使用属性逻辑并修改xml,如下所示:
x = """<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>"""
使用lxml:
import lxml.etree as et
xml = et.fromstring(x)
for node in xml.xpath("//neighbor[not(@make) and not(@job) and not(@make)]"):
node.getparent().remove(node)
print(et.tostring(xml))
会给你:
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" make="foo" build="bar" job="blah"/>
</country>
</data>
ElementTree中的相同逻辑:
from xml.etree import ElementTree as et
xml = et.parse("test.xml").getroot()
atts = {"build", "job", "make"}
for parent in xml.findall(".//neighbor/.."):
for child in parent.findall(".//neighbor")[:]:
if not atts.issubset(child.attrib):
parent.remove(child)
如果你使用iter:
from xml.etree import ElementTree as et
xml = et.parse("test.xml")
for parent in xml.getroot().iter("*"):
parent[:] = (child for child in parent if child.tag != "neighbor")
你可以看到我们得到完全相同的输出:
In [30]: !cat /home/padraic/untitled6/test.xml
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">#
<neighbor name="Austria" direction="E"/>
<rank>1</rank>
<neighbor name="Austria" direction="E"/>
<year>2008</year>
<neighbor name="Austria" direction="E"/>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
In [31]: paste
def test():
import lxml.etree as et
xml = et.parse("/home/padraic/untitled6/test.xml")
for node in xml.xpath("//neighbor"):
node.getparent().remove(node)
a = et.tostring(xml)
from xml.etree import ElementTree as et
xml = et.parse("/home/padraic/untitled6/test.xml")
for parent in xml.getroot().iter("*"):
parent[:] = (child for child in parent if child.tag != "neighbor")
b = et.tostring(xml.getroot())
assert a == b
## -- End pasted text --
In [32]: test()
答案 1 :(得分:1)
每当需要修改XML文档时,还要考虑XSLT,它是包含XPath的XSL系列的特殊用途语言部分。 XSLT专门用于转换XML文件。 Pythoners不会很快推荐它,但它避免了通用代码中循环或嵌套if / then逻辑的需要。 Python的lxml
模块可以使用libxslt处理器运行XSLT 1.0脚本。
在转换下运行身份转换以按原样复制文档,然后在<neighbor>
上运行空模板匹配以将其删除:
XSLT 脚本(保存为.xsl文件,就像源.xml一样加载,两者都是格式正确的xml文件)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM TO COPY XML AS IS -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- EMPTY TEMPLATE TO REMOVE NEIGHBOR WHEREVER IT EXISTS -->
<xsl:template match="neighbor"/>
</xsl:transform>
Python 脚本
import lxml.etree as et
# LOAD XML AND XSL DOCUMENTS
xml = et.parse("Input.xml")
xslt = et.parse("Script.xsl")
# TRANSFORM TO NEW TREE
transform = et.XSLT(xslt)
newdom = transform(xml)
# CONVERT TO STRING
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
# OUTPUT TO FILE
xmlfile = open('Output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
答案 2 :(得分:1)
这里的技巧是找到父(国家节点),并从那里删除邻居。在这个例子中,我使用的是ElementTree,因为我对它有点熟悉:
import xml.etree.ElementTree as ET
if __name__ == '__main__':
with open('debug.log') as f:
doc = ET.parse(f)
for country in doc.findall('.//country'):
for neighbor in country.findall('neighbor'):
country.remove(neighbor)
ET.dump(doc) # Display