Question

我已经尽力通过代码添加注释，但我有点卡在某些部分。

// create a new instance of the HtmlDocument Class called doc
1: HtmlDocument doc = new HtmlDocument();

// the Load method is called here to load the variable result which is html 
// formatted into a string in a previous code snippet
2: doc.Load(new StringReader(result));

// a new variable called root with datatype HtmlNode is created here. 
// Im not sure what doc.DocumentNode refers to?
3: HtmlNode root = doc.DocumentNode;
4:  

// a list is getting constructed here. I haven't had much experience 
// with constructing lists yet
5: List<string> anchorTags = new List<string>();
6:  

// a foreach loop is used to loop through the html document to 
// extract html with 'a' attributes I think..      
7: foreach (HtmlNode link in root.SelectNodes("//a"))
8: {
// dont really know whats going on here
9:     string att = link.OuterHtml;
// dont really know whats going on here too
10:     anchorTags.Add(att)
11: }

我已从here解除了此代码示例。感谢Farooq Kaiser

Answer 1

关键是SelectNodes方法。这部分使用XPath从HTML中获取与您的查询匹配的节点列表。

这是我学习XPath的地方：http://www.w3schools.com/xpath/default.asp

然后它只是遍历那些匹配并获取OuterHTML的节点 - 包含标签的完整HTML，并将它们添加到字符串列表中。 List基本上只是一个数组，但更灵活。如果您只想要内容而不是封闭标签，则可以使用HtmlNode.InnerHTML或HtmlNode.InnerText。

Answer 2

在HTML Agility Pack术语中，“// a”表示“在文档中的任何位置查找名为'a'或'A'的所有标记”。有关XPATH的更一般帮助，请参阅XPATH文档（独立于HTML敏捷包）。因此，如果您的文档看起来像这样：

<div>
  <A href="xxx">anchor 1</a>
  <table ...>
    <a href="zzz">anchor 2</A>
  </table>
</div>

您将获得两个锚定HTML元素。 OuterHtml表示节点的HTML，包括节点本身，而InnerHtml仅表示节点的HTML内容。所以，这里有两个OuterHtml：

  <A href="xxx">anchor 1</a>

和

<a href="zzz">anchor 2</A>

注意我已经指定了'a'或'A'，因为HAP实现需要注意或HTML不区分大小写。并且“// A”默认情况下不起作用。您需要使用小写指定标记。

有人可以解释一下这个HtmlAgilityPack代码吗？

2 个答案: