HtmlAgilityPack如何在某些标签之间提取html

时间:2016-05-19 10:32:36

标签: c# html-agility-pack

我需要从一个html中提取所有段落,并在该标记之间提取所有文本。

当解析为HtmlDocument的文本从原始文本更改时,此代码不起作用。在样本中

some <br />text

已更改

some <br>text

ES:

string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
int lastPos = -1;
foreach (HtmlAgilityPack.HtmlNode n in nodes)
{
  if (lastPos > -1)
  {
      string textNotInP = Doc.DocumentNode.OuterHtml.Substring(lastPos, n.StreamPosition - lastPos);
                System.Diagnostics.Debug.WriteLine(textNotInP);
 }
 System.Diagnostics.Debug.WriteLine(n.OuterHtml);
 lastPos = n.StreamPosition + n.OuterHtml.Length;
}

正确的结果是:

<p>firt paragraph</p>
some <br>text
<p>second paragraph</p>
<span>some text between span</span>
<p>third paragraph</p>

但上面的代码会返回:

<p>firt paragraph</p>
some <br>text<p
<p>second paragraph</p>
pan>some text between span</span><p
<p>third paragraph</p>

原因是steamPosition返回与原始文本相关的节点位置,而不是在htmlDocument中解析的那个。

有没有办法返回与解析后的html相关的一个节点的位置?

1 个答案:

答案 0 :(得分:0)

您可以使用每个OuterHtml元素的<p>属性来获取所需的HTML:

string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
foreach (var item in nodes)
{
    Console.WriteLine(item.OuterHtml);
}

输出

<p>firt paragraph</p>
<p>another paragraph</p>
<p>hellow word</p>

或者,如果您想要在第一个<p>和最后一个<p>元素之间获取所有内容,则可以使用以下XPath:

var query = "//node()[preceding-sibling::p or self::p][following-sibling::p or self::p]";

XPath抓取所有节点(元素或文本节点):前面有兄弟p并且跟随兄弟p,或节点本身是p元素。

var nodes = doc.DocumentNode.SelectNodes(query);
foreach (var item in nodes)
{
    Console.WriteLine(item.OuterHtml);
}

输出

<p>firt paragraph</p>
some
<br />
text
<p>another paragraph</p>
<span>some text between span</span>
<p>hellow word</p>