Question

我的任务是在python 3中对XML树的某些元素进行轻微的重构，即替换以下结构：

<span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/
  <sup>
   <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/>
  </sup>
 </a>
</span>

使用：

<span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/
 </a>
</span>

即。 - 如果整个结构完全对应于第一个例子中给出的结构，则删除sup元素。我需要在处理过程中保留XML文档，因此regexp匹配不太可能。

我已经拥有适用于我的目的的代码：

doc = self.__refactor_links(doc)
...
def __refactor_links(self, node):
    """Recursively seeks for links to refactor them"""
    for span in node.childNodes:
        replace = False
        if isinstance(span, xml.dom.minidom.Element):
            if span.tagName == "span" and span.getAttribute("class") == "nobr":
                if span.childNodes.length == 1:
                    a = span.childNodes.item(0)
                    if isinstance(a, xml.dom.minidom.Element):
                        if a.tagName == "a" and a.getAttribute("href"):
                            if a.childNodes.length == 2:
                                aurl = a.childNodes.item(0)
                                if isinstance(aurl, xml.dom.minidom.Text):
                                    sup = a.childNodes.item(1)
                                    if isinstance(sup, xml.dom.minidom.Element):
                                        if sup.tagName == "sup":
                                            if sup.childNodes.length == 1:
                                                img = sup.childNodes.item(0)
                                                if isinstance(img, xml.dom.minidom.Element):
                                                    if img.tagName == "img" and img.getAttribute("class") == "rendericon":
                                                        replace = True
            else:
                self.__refactor_links(span)
        if replace:
            a.removeChild(sup)
    return node

这个标签不会递归地遍历所有标签 - 如果它匹配类似于它所寻找的结构的东西 - 即使它失败了，它也不会继续在这些元素中寻找结构，但在我的情况下我我不应该这样做（虽然这样做也会很好，但是增加一堆其他成本：self .__ refactor_links（tag）会在我眼中杀死它。）

如果任何条件失败，则不应进行删除。是否有更简洁的方法来定义一组条件，避免大量的“ifs”？一些定制数据结构可用于存储条件，例如，（'sup'，（'img'，（...））），但我不知道应该如何处理它。如果您在python中有任何建议或示例 - 请帮助。

感谢。

Answer 1

这绝对是XPath表达式的任务，在您的情况下可能与lxml一起使用。

XPath可能与以下内容类似：

//span[@class="nobr"]/a[@href]/sup[img/@class="rendericon"]

将您的树与此XPath表达式匹配，并删除所有匹配的元素。如果构造或递归，则无需无穷无尽。

Answer 2

我对xml不好，但你不能在节点上使用find / search

>>> from xml.dom.minidom import parse, parseString
>>> dom = parseString(x)
>>> k = dom.getElementsByTagName('sup')
>>> for l in k:
...     p = l.parentNode
...     p.removeChild(l)
... 
<DOM Element: sup at 0x100587d40>
>>> 
>>> print dom.toxml()
<?xml version="1.0" ?><span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/

 </a>
</span>
>>>

Answer 3

lxml这是一个快速的事情。强烈推荐xpath。

>>> from lxml import etree
>>> doc = etree.XML("""<span class="nobr">
...  <a href="http://www.google.com/">
...   http://www.google.com/
...   <sup>
...    <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/>
...   </sup>
...  </a>
... </span>""")
>>> for a in doc.xpath('//span[@class="nobr"]/a[@href="http://www.google.com/"]'):
...     for sub in list(a):
...         a.remove(sub)
...
>>> print etree.tostring(doc,pretty_print=True)
<span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/
  </a>
</span>

Answer 4

使用lxml和XSLT轻松完成：

>>> from lxml import etree
>>> from StringIO import StringIO
>>> # create the stylesheet
>>> xslt = StringIO("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <!-- this is the standard identity transform -->
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
  <!-- this replaces the specific node you're looking to replace -->
  <xsl:template match="span[a[@href='http://www.google.com' and 
                    sup[img[
                      @align='absmiddle' and
                      @border='0' and
                      @class='rendericon' and
                      @height='7' and
                      @src='http://jira.atlassian.com/icon.gif' and
                      @width='7']]]]">
    <span class="nobr">
      <a href="http://www.google.com/">http://www.google.com/</a>
    </span>
  </xsl:template>
</xsl:stylesheet>""")
>>> # create a transform function from the XSLT stylesheet
>>> transform = etree.XSLT(etree.parse(xslt))
>>> # here's a sample source XML instance for testing
>>> source = StringIO("""
<test>
  <span class="nobr">
   <a href="http://www.google.com/">
    http://www.google.com/
    <sup>
     <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/>
    </sup>
   </a>
  </span>
</test>""")
>>> # parse the source, transform it to an XSLT result tree, and print the result
>>> print etree.tostring(transform(etree.parse(source)))
<test>
  <span class="nobr"><a href="http://www.google.com/">http://www.google.com/</a></span>
</test>

修改

我应该注意到，没有一个答案 - 不是我的，不是MattH的，当然也不是OP发布的例子 - 做OP所要求的，这只是替换结构正好匹配的元素

<span class="nobr"> <a href="http://www.google.com/"> http://www.google.com/ <sup> <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/> </sup> </a> </span>

例如，如果sup具有img属性，或style除了sup之外还有另一个孩子，则所有这些示例都会替换img }。

可以构造一个XPath表达式，它在匹配方式上要严格得多。例如，而不是使用

span[a]

匹配任何span与至少一个a子项，您可以使用

span[count(@*)=0 and count(*)=1 and a]

匹配任何没有属性的span和恰好一个子元素，其中该子元素为a。你可以在追求精确性时疯狂，例如：

span[count(@*) = 1 and @class='nobr' and count(*) = 1 and a[count(@*) = 1 and @href='http://www.google.com' and count(*) = 1 and sup[count(@*) = 0 and count(*) = 1 and img[count(*) = 0 and count(@*) = 7 and @align='absmiddle' and @alt='' and @border='0' and @class='rendericon' and @height='7' and @src='http://jira.atlassian.com/icon.gif' and @width='7']]]]

，在匹配的每一步，确保匹配的元素仅包含指定的属性和元素，不再包含。（并且它仍然不会验证它们不包含文本，注释或处理说明 - 如果您真的认真对待，请在count(node())处使用count(*)。）

从XML文档树中有条件地删除元素

4 个答案: