Question

我正试图找到一种方法来清理HTML源代码中的一堆空DOM元素，如下所示：

<div class="empty">
    <div>&nbsp;</div>
    <div></div>
</div>
<a href="http://example.com">good</a>
<div>
    <p></p>
</div>
<br>
<img src="http://example.com/logo.png" />
<div></div>

但是，我不想损害有效元素或换行符。所以结果应该是这样的：

<a href="http://example.com">good</a>
<br>
<img src="http://example.com/logo.png" />

到目前为止，我尝试了一些像这样的XPath：

$xpath = new DOMXPath($dom);

//$x = '//*[not(*) and not(normalize-space(.))]';
//$x = '//*[not(text() or node() or self::br)]';
//$x = 'not(normalize-space(.) or self::br)';
$x = '//*[not(text() or node() or self::br)]';

while(($nodeList = $xpath->query($x)) && $nodeList->length > 0) {
    foreach ($nodeList as $node) {
        $node->parentNode->removeChild($node);
    }
}

有人可以告诉我正确的XPath来删除空的DOM节点吗？（即使是空的，img，br和input也有用）

当前输出：

<div>
    <div>&nbsp;</div>

</div>
<a href="http://example.com">good</a>
<div>

</div>
<br>

更新

为了澄清，我正在寻找一个XPath查询：

在匹配空节点时递归，直到找到所有节点（包括空节点的父节点）
每次清理后可以成功运行多次（如我的例子所示）

Answer 1

<强>予。初步解决方案：

XPath是XML文档的查询语言。因此，评估XPath表达式只选择节点或从XML文档中提取非节点信息，但是永远不要改变XML文档。因此，评估XPath表达式永远不会删除或插入节点 - XML文档保持不变。

你想要的是＆＃34;从HTML源中清除一堆空的DOM元素＆＃34;并且无法单独使用XPath 。

这是由XPath中最可靠和唯一的官方（我们说规范）来源证实的 - W3C XPath 1.0 Recommendation ：

＆＃34; XPath的主要目的是解决部分XML [XML] 文献。为了支持这一主要目的，它还提供了基本功能操纵弦乐，数字和布尔值的设施。 XPath的使用紧凑的非XML语法来促进在URI中使用XPath 和XML属性值。 XPath在抽象，逻辑上运行 XML文档的结构，而不是其表面语法。 XPath的从URL中使用路径表示法获取其名称浏览XML文档的层次结构。＆＃34;

因此，必须使用一些其他语言与XPath结合才能实现require功能。

XSLT是一种专为XML转换而设计的语言。

这是一个基于XSLT的示例 - 执行请求清理的简短XSLT转换：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
 "*[not(string(translate(., '&#xA0;', '')))
  and
    not(descendant-or-self::*
          [self::img or self::input or self::br])]"/>
</xsl:stylesheet>

应用于提供的XML （已更正为格式良好的XML文档）：

<html>
    <div class="empty">
        <div>&#xA0;</div>
        <div></div>
    </div>
    <a href="http://example.com">good</a>
    <div>
        <p></p>
    </div>
    <br />
    <img src="http://example.com/logo.png" />
    <div></div>
</html>

产生了想要的正确结果：

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>

<强>解释：

身份规则复制＆＃34;原样＆＃34;选择执行它的每个节点。
有一个模板，覆盖任何元素的身份模板（img，input和br除外），其字符串值从中{ {1}}已被删除，是空字符串。这个模板的主体是空的，有效地删除了＃34;匹配的元素 - 匹配的元素不会被复制到输出中。

<强> II。更新：

OP澄清他想要一个或多个XPath表达式：

＆＃34; 每次清理后都可以成功运行多次。＆＃34;

有趣的是，存在一个XPath表达式，它准确地选择了所有需要删除的节点 - 因此＆＃34;多次清理＆＃34;完全避免：

&nbsp;

基于XSLT的验证：

//*[not(normalize-space((translate(., '&#xA0;', ''))))
  and
    not(descendant-or-self::*[self::img or self::input or self::br])
    ]
     [not(ancestor::*
             [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                         and
                           not(descendant-or-self::*
                                  [self::img or self::input or self::br])
                          ]
                    )
             =
              count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                      and
                        not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                        ]
                   )
              ]
          )
     ]

当对提供的（并且制作良好的）XML文档（上面）应用此转换时，将复制所有节点＆＃34;原样＆＃34;除了我们的XPath表达式选择的节点：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
   "//*[not(normalize-space((translate(., '&#xA0;', ''))))
      and
        not(descendant-or-self::*[self::img or self::input or self::br])
       ]
        [not(ancestor::*
               [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                           and
                             not(descendant-or-self::*
                                    [self::img or self::input or self::br])
                             ]
                      )
               =
                count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                        and
                          not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                          ]
                      )
               ]
            )
        ]
 "/>
</xsl:stylesheet>

<强>解释：

让我们用<html> <a href="http://example.com">good</a> <br/> <img src="http://example.com/logo.png"/> </html>表示所有“＃34;空”＆＃34;根据＆＃34;空＆＃34;的定义在问题中。

$vAllEmpty表示以下XPath表达式：

$vAllEmpty

要删除所有这些内容，我们只需删除＆＃34;顶部节点＆＃34;来自//*[not(normalize-space((translate(., ' ', '')))) and not(descendant-or-self::* [self::img or self::input or self::br]) ]

让我们表示所有这些＆＃34;顶部节点＆＃34; as：$vAllEmpty。

$vTopEmpty

$vTopEmpty：

$vAllEmpty

这将从$vAllEmpty[not(ancestor::* intersect $vAllEmpty)]中选择那些不具有$vAllEmpty中的祖先元素的节点。

最后一个XPath表达式具有等效的XPath 1.0表达式：

$vAllEmpty

现在，我们将最后一个表达式$vAllEmpty[not(ancestor::*[count(.|$vAllEmpty) = count($vAllEmpty)])]替换为上面定义的扩展XPath表达式，这就是我们获取最终表达式的方式，它只选择要删除的＆＃34;顶部节点＆＃ 34;：

$vAllEmpty

使用变量进行基于XSLT-2.0的简短验证：

//*[not(normalize-space((translate(., ' ', '')))) and not(descendant-or-self::*[self::img or self::input or self::br]) ] [not(ancestor::* [count(.| //*[not(normalize-space((translate(., ' ', '')))) and not(descendant-or-self::* [self::img or self::input or self::br]) ] ) = count(//*[not(normalize-space((translate(., ' ', '')))) and not(descendant-or-self::* [self::img or self::input or self::br]) ] ) ] ) ]

此转换会复制每个节点＆＃34; as-is＆＃34;属于<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/> <xsl:variable name="vAllEmpty" select= "//*[not(normalize-space((translate(., ' ', '')))) and not(descendant-or-self::* [self::img or self::input or self::br]) ]"/> <xsl:variable name="vTopEmpty" select= "$vAllEmpty[not(ancestor::* intersect $vAllEmpty)]"/> <xsl:template match="node()|@*"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="*[. intersect $vTopEmpty]"/> </xsl:stylesheet>的任何节点除外。结果是正确的和预期的：

$vTopEmpty

<强> III。替代解决方案（可能需要＆＃34;多次清理＆＃34;）：

另一种方法不是尝试指定要删除的节点，而是指定要保留的节点 - 然后要删除的节点是所有节点和要保留的节点之间的集合差异。

要保留的节点由此XPath表达式选择：

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>

然后要删除的节点：

  //node()
    [self::input or self::img or self::br
    or
     self::text()[normalize-space(translate(.,'&#xA0;',''))]
    ]
     /ancestor-or-self::node()

但是，请注意这些所有节点要删除，而不仅仅是删除＆＃34;顶级节点＆＃34;。可以仅表示删除＆＃34;的顶级节点，但结果表达式相当复杂。如果有人试图删除要删除的所有节点，则会出现错误，因为＆＃34;顶级节点的后代要删除＆＃34;按文档顺序关注它们。

Answer 2

您想要文本节点<br>和<img>以及他们的祖先吗？

您可以使用//br和//img获取所有br和img。

您可以使用//text()获取所有文本节点，并使用//text()[normalize-space()]获取所有非空文本节点。（尽管如果您的xml解析器尚未执行此操作，您可能需要//text()[normalize-space(translate(., ' ', ''))]之类的内容来过滤 文本节点。

你可以让所有父母ancestor-or-self::*。

结果表达式是

//br/ancestor-or-self::* | //img/ancestor-or-self::* | //text()[normalize-space()]/ancestor-or-self::*

在XPath 2中更短：

(//br | //img | //text()[normalize-space()])/ancestor-or-self::*

Answer 3

您是否尝试过与此类似的XPath？

*[not(*) and not(text()[normalize-space()])]

使用

not(*) =没有子元素
text()[normalize-space()] =包含非空白文本的节点（不反转）

Answer 4

实现所需结果的最简单方法是在文本中使用正则表达式。有了注释：你必须多次使用这个表达式，因为它不是贪婪的，它只删除最低的空子节点，所以为了删除所有空节点，我们必须多次调用正则表达式。

以下是解决方案：

<?
$text = '<div class="empty">
    <div>&nbsp;</div>
    <div></div>
</div>
<a href="http://example.com">good</a>
<div>
    <p></p>
</div>
<br>
<img src="http://example.com/logo.png" />
<div></div>';

// recursive function
function recreplace($text)
{
    $restext = preg_replace("/<div(.*)?>((\s|&nbsp;)*|(\s|&nbsp;)*<p>(\s|&nbsp;)*<\/p>(\s|&nbsp;)*)*<\/div>/U", '', $text);
    if ($text != $restext) 
    {
        recreplace($restext);
    }
    else
    {
        return $restext;
    }
}

print recreplace($text);
?>

此代码打印您想要的结果。如果您需要编辑正则表达式，则可以在其中添加任何其他应计为空的标记（如<p> </p>）。

在给定的例子中，这个函数将自己调用两次，第三次没有任何替换 - 这就是结果。

XPath以递归方式删除空DOM节点？

更新

4 个答案: