Question

我正在尝试解析HTML文档，以便检索页面中的特定链接。我知道这可能不是最好的方法，但我试图通过内部文本找到我需要的HTML节点。但是，HTML中有两个实例：页脚和导航栏。我需要导航栏中的链接。 HTML中的“页脚”排在第一位。这是我的代码：

    public string findCollegeURL(string catalog, string college)
    {
        //Find college
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(catalog);
        var root = doc.DocumentNode;
        var htmlNodes = root.DescendantsAndSelf();

        // Search through fetched html nodes for relevant information
        int counter = 0;
        foreach (HtmlNode node in htmlNodes) {
            string linkName = node.InnerText;
            if (linkName == colleges[college] && counter == 0)
            {
                counter++;
                continue;
            }  
            else if(linkName == colleges[college] && counter == 1)
            {
                string targetURL = node.Attributes["href"].Value; //"found it!"; //
                return targetURL;
            }/* */
        }

        return "DID NOT WORK";
    }

程序正在进入if else语句，但在尝试检索链接时，我得到一个NullReferenceException。 为什么？如何检索我需要的链接？

以下是我尝试访问的HTML文档中的代码：

    <tr class>
       <td id="acalog-navigation">
           <div class="n2_links" id="gateway-nav-current">...</div>
           <div class="n2_links">...</div>
           <div class="n2_links">...</div>
           <div class="n2_links">...</div>
           <div class="n2_links">...</div>
              <a href="/content.php?catoid=10&navoid=1210" class"navbar" tabindex="119">College of Science</a> ==$0
           </div>

这是我想要的链接：/ content.php？catoid = 10＆amp; navoid = 1210

Answer 1

我发现使用XPath更容易使用而不是编写大量代码

var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
              .Attributes["href"].Value;

如果您有2条带有相同文字的链接，请选择第二个

var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
              .Attributes["href"].Value;

Linq 版本

var links = doc.DocumentNode.Descendants("a")
               .Where(a => a.InnerText == "College of Science")
               .Select(a => a.Attributes["href"].Value)
               .ToList();

使用HTML Agility Pack在html doc c＃中查找特定链接

1 个答案: