使用lxml的ElementTree API实现从XML文档中完全删除给定元素很容易,但是我看不到用一些文本一致地替换元素的简单方法。例如,给出以下输入:
input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
...您可以使用
轻松删除每个<r>
元素
from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)
但是,你将如何用文本替换每个元素,以获得输出:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
在我看来,因为ElementTree API通过每个元素的.text
和.tail
属性处理文本而不是树中的节点,这意味着你必须处理很多不同的取决于元素是否具有兄弟元素,现有元素是否具有.tail
属性,等等。我错过了一些简单的方法吗?
答案 0 :(得分:16)
我认为unutbu的XSLT解决方案可能是实现目标的正确方法。
然而,通过修改<r/>
标签的尾部然后使用etree.strip_elements
来实现它是一种有点愚蠢的方法。
from lxml import etree
data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'
etree.strip_elements(f,'r',with_tail=False)
print etree.tostring(f,pretty_print=True)
给你:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
答案 1 :(得分:7)
使用strip_elements
的缺点是,在替换其他元素时,您无法保留一些<r>
元素。它还需要存在ElementTree
实例(可能不是这种情况)。最后,您不能使用它来替换XML注释或处理指令。
以下应该做你的工作:
for r in f.xpath('//r'):
text = 'DELETED' + r.tail
parent = r.getparent()
if parent is not None:
previous = r.getprevious()
if previous is not None:
previous.tail = (previous.tail or '') + text
else:
parent.text = (parent.text or '') + text
parent.remove(r)
答案 2 :(得分:3)
使用ET.XSLT:
import io
import lxml.etree as ET
data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
f=ET.fromstring(data)
xslt='''\
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Replace r nodes with DELETED
http://www.w3schools.com/xsl/el_template.asp -->
<xsl:template match="r">DELETED</xsl:template>
<!-- How to copy XML without changes
http://mrhaki.blogspot.com/2008/07/copy-xml-as-is-with-xslt.html -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@*|text()|comment()|processing-instruction">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
'''
xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
f=transform(f)
print(ET.tostring(f))
产量
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>