Question

我正在使用.NET 3.5（C＃）和HTML Agility Pack进行网页抓取。我需要提取的一些字段被构造为段落，其中组件由换行符标记分隔。我希望能够在换行符之间选择单个组件。每个组件可以由多个元件形成（即，它可以不仅仅是单个串）。例如：

<h3>Section title</h3>
<p>
  <b>Component A</b><br />
  Component B <i>includes</i> <strong>multiple elements</strong><br />
  Component C
</p>

我想选择

<b>Component A</b>

然后：

Component B <i>includes</i> <strong>multiple elements</strong>

然后：

Component C

也可能有更多（<br />分隔的）组件。

我可以轻松地获得第一个组件：

p/br[1]/preceding-sibling::node()

我也可以轻松地获得最后一个组件：

p/br[2]/following-sibling::node()

但是我还没有弄清楚如何提取节点/ /两个其他标签之间的节点（即兄弟节点但在节点X之前并跟随节点Y的节点）。

另一种方法是手动扫描元素 - 如果这是最简单的方法，那么这就是我要做的事情，但XPath迄今为止给我的印象深刻，所以我希望有一种方法这样做。

修改

由于我需要应对超过3个组件的情况，似乎答案至少需要多次XPath调用，所以我将继续基于此的解决方案（这是我接受的答案） “）。 AakashM的回答也帮助了我对XPath的理解，所以我投票了。

谢谢大家的帮助！我希望有一天能帮我回报。

编辑2

Dimitre Novatchev提供的新答案，经过一些调整，确实可以正常运作。

解决方案：

int i = 0;
do
{
    yield return para.SelectNodes(String.Format(
        "node()[not(self::br) and count(preceding-sibling::br) = {0}]", i));
    ++i;
} while (para.SelectSingleNode(String.Format("br[{0}]", i)) != null);

我应该注意，由于重复的XPath查询以确定是否还有br个标记，这个循环效率有点低。在我的情况下，低效率不是一个问题，但要注意，如果你想在其他情况下使用这个答案（然后再次，如果你确实想在性能敏感的情况下这样做，你可能应该手动扫描，而不是使用XPath）。

完整的测试代码（AakashM帮助包含的测试代码的修改版本）：

using System;
using System.Collections.Generic;
using System.Xml;

namespace TestXPath
{
    class Program
    {
        static void Main(string[] args)
        {
            XmlDocument doc = new XmlDocument();
            doc.LoadXml(@"
<x>
 <h3>Section title</h3>
 <p>
  <b>Component A</b><br />
  Component B <i>includes</i> multiple <strong>elements</strong><br />
  Component C
 </p>
</x>
            ");


            foreach (var nodes in SplitOnLineBreak(doc.SelectSingleNode("x/p")))
            {
                Dump(nodes);
                Console.WriteLine();
            }

            Console.ReadLine();
        }

        private static IEnumerable<XmlNodeList> SplitOnLineBreak(XmlNode para)
        {
            int i = 0;
            do
            {
                yield return para.SelectNodes(String.Format(
                    "node()[not(self::br) and count(preceding-sibling::br) = {0}]", i));
                ++i;
            } while (para.SelectSingleNode(String.Format("br[{0}]", i)) != null);
        }

        private static void Dump(XmlNodeList nodes)
        {
            foreach (XmlNode node in nodes)
            {
                Console.WriteLine(string.Format("-->{0}<---", 
                                  node.OuterXml));                    
            }
        }
    }
}

Answer 1

如果在你的情况下，你总是有三个“棋子”，用br分隔，你可以使用这个XPath来获得中间的“棋子”：

//node()[preceding::br and following::br]

使用preceding和following轴在两个br之间返回所有节点。

编辑这是我的测试应用（请原谅XmlDocument，我还在使用.NET 2.0 ...）

using System;
using System.Xml;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            XmlDocument doc = new XmlDocument();
            doc.LoadXml(@"
<x>
 <h3>Section title</h3>
 <p>
  <b>Component A</b><br />
  Component B <i>includes</i> <strong>multiple elements</strong><br />
  Component C
 </p>
</x>
            ");

            XmlNodeList nodes = doc.SelectNodes(
                "//node()[preceding::br and following::br]");

            Dump(nodes);

            Console.ReadLine();
        }

        private static void Dump(XmlNodeList nodes)
        {
            foreach (XmlNode node in nodes)
            {
                Console.WriteLine(string.Format("-->{0}<---", 
                                  node.OuterXml));                    
            }
        }
    }
}

这是输出：

-->
      Component B <---
--><i>includes</i><---
-->includes<---
--><strong>multiple elements</strong><---
-->multiple elements<---

正如您所看到的，XmlNodeList之间的所有内容都会br。

我想到的方法是：这个XPath在任何地方返回任何节点，只要该节点的，前一个轴包含br，和以下轴包含br。

Answer 2

怎么样：

p/*[not(local-name()='br')]

然后将该表达式索引为您想要的任何术语

编辑：

对于您的索引问题：

p/*[not(local-name()='br') and position() < x and position() > y]

Answer 3

尝试使用position（）或count（）方法。这是一个 guess ，它可以帮助你得到正确的语法。

p/*[position() > position(/p/br[1]) and position() < position(/p/br[2])]

编辑：请在投票或评论之前阅读评论。

Answer 4

这可以通过XPath 2.0或XSLT托管的XPath 1.0轻松完成。

在.NET托管的XPath 1.0中，可以通过以下几个步骤实现：

将适当的“p”节点设为当前节点。
查找当前“p”节点的所有 <br /> 子项的数量：

<强>计数（BR）
如果N是计数，则在步骤2中为0到N中的 $ k 确定：

3.1查找前面带有 $ k <br /> 元素的所有节点：

node（）[not（self :: br）and count（preceding :: br）= $ k]

3.2对于找到的每个此类节点，获取其字符串值

3.3连接在步骤3.2中获得的所有字符串值。 此连接的结果是给定段落中包含的所有文本。

注意：为了替换步骤3.1中 $k 的内容，有必要动态构建此表达式。

使用XPath在两个标签之间选择（兄弟）（在.NET中）

修改

编辑2

4 个答案: