Question

我没有找到一种明确的方法来选择HTML文件中两个锚点（<a></a>标记对）之间存在的所有节点。

第一个锚具有以下格式：

<a href="file://START..."></a>

第二个锚：

<a href="file://END..."></a>

我已经确认可以使用starts-with选择两者（请注意我使用的是HTML Agility Pack）：

HtmlNode n0 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://START')]"));
HtmlNode n1 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://END')]"));

考虑到这一点，并且凭借我的业余XPath技能，我编写了以下表达式来获取两个锚点之间的所有标记：

html.DocumentNode.SelectNodes("//*[not(following-sibling::a[starts-with(@href,'file://START0')]) and not (preceding-sibling::a[starts-with(@href,'file://END0')])]");

这似乎有效，但选择所有HTML文档！

我需要，例如以下HTML片段：

<html>
...

<a href="file://START0"></a>
<p>First nodes</p>
<p>First nodes
    <span>X</span>
</p>
<p>First nodes</p>
<a href="file://END0"></a>

...
</html>

删除两个锚点，即三个P（当然包括内部SPAN）。

有什么办法吗？

我不知道XPath 2.0是否提供了更好的方法来实现这一目标。

* 编辑（特殊情况！）*

我还应该处理以下情况：

“在X和X'之间选择标签，其中X为<p><a href="file://..."></a></p>”

所以而不是：

<a href="file://START..."></a>
<!-- xhtml to be extracted -->
<a href="file://END..."></a>

我也应该处理：

<p>
  <a href="file://START..."></a>
</p>
<!-- xhtml to be extracted -->

<p>
  <a href="file://END..."></a>
</p>

再次非常感谢。

Answer 1

使用此XPath 1.0表达式：

//a[starts-with(@href,'file://START')]/following-sibling::node()
     [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
     =
      count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
     ]

或者，使用此XPath 2.0表达式：

    //a[starts-with(@href,'file://START')]/following-sibling::node()
  intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()

XPath 2.0表达式使用XPath 2.0 intersect运算符。

XPath 1.0表达式使用Kayessian（在@Michael Kay之后）公式用于两个节点集的交叉连接：

$ns1[count(.|$ns2) = count($ns2)]

使用XSLT进行验证：

此XSLT 1.0转换：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "    //a[starts-with(@href,'file://START')]/following-sibling::node()
         [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
         =
          count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
         ]
  "/>
 </xsl:template>
</xsl:stylesheet>

应用于提供的XML文档：

<html>...
    <a href="file://START0"></a>
    <p>First nodes</p>
    <p>First nodes    
        <span>X</span>
    </p>
    <p>First nodes</p>
    <a href="file://END0"></a>...
</html>

生成想要的正确结果：

<p>First nodes</p>
<p>First nodes    
        <span>X</span>
</p>
<p>First nodes</p>

此XSLT 2.0转换：

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  " //a[starts-with(@href,'file://START')]/following-sibling::node()
   intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()
  "/>
 </xsl:template>
</xsl:stylesheet>

当应用于同一个XML文档（上面）时，再次产生完全想要的结果。

Answer 2

我添加了一个我应该处理的特殊情况

要处理这种特殊情况，你可以以同样的方式工作，我的意思是使用Kayessian（并使用XPath Visualizer ;-)）。交叉节点集的更改如下：

相交节点集C

    "//p[.//a[starts-with(@href,'file://START')]]
         /following-sibling::node()"

包含p START

的a的所有兄弟姐妹。

相交节点集D

"./following-sibling::p[.//a[starts-with(@href,'file://END')]] /preceding-sibling::node())"

p包含a END 的所有前任兄弟姐妹以及当前p的兄弟姐妹

现在你可以执行交叉点：

C∩D

那是

"//p[.//a[starts-with(@href,'file://START')]] /following-sibling::node()[ count(.| ./following-sibling::p [.//a[starts-with(@href,'file://END')]] /preceding-sibling::node()) = count(./following-sibling::p [.//a[starts-with(@href,'file://END')]] /preceding-sibling::node()) ]"

如果您需要管理这两种情况，可以继续将相交节点集合并为

（A∩B）∪（C∩D）

其中：

必须使用XPath联合运算符|：

节点集A e B已经在@ Dimitre'answer
中显示
节点集C e D是我的答案中显示的那些。

XPath表达式：选择A HREF =“expr”标记之间的元素

2 个答案:

C∩D

（A∩B）∪（C∩D）