Question

我有一个字符串变量，其中包含网页的整个HTML。该网页将包含指向其他网站的链接。我想创建一个所有hrefs的列表（像webcrawler一样）。什么是最好的方法呢？使用任何扩展功能会有帮助吗？那么使用正则表达式呢？

先谢谢

Answer 1

使用诸如HTML Agility Pack之类的DOM解析器来解析您的文档并查找所有链接。

关于如何使用HTML Agility Pack here，有一个很好的问题。这是一个让你入门的简单例子：

string html = "your HTML here";

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(html);

var links = doc.DocumentNodes.DescendantNodes()
   .Where(n => n.Name == "a" && n.Attributes.Contains("href")
   .Select(n => n.Attributes["href"].Value);

Answer 2

我想你会发现这个回答你的问题到了

http://msdn.microsoft.com/en-us/library/t9e807fx.aspx

：）

Answer 3

我会选择Regex。

        Regex exp = new Regex(
            @"{href=}*{>}",
            RegexOptions.IgnoreCase);
        string InputText; //supply with HTTP
        MatchCollection MatchList = exp.Matches(InputText);

Answer 4

试试这个正则表达式（应该可以）：

var matches = Regex.Matches (html, @"href=""(.+?)""");

您可以浏览匹配并提取捕获的网址。

Answer 5

您是否考虑过使用HTMLAGILITYPACK？ http://htmlagilitypack.codeplex.com/

通过这个，你可以简单地使用XPATH来获取页面上的所有链接并将它们放入列表中。

private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

取自此处的其他帖子 - Get all links on html page?

在字符串中搜索字符串（在HTML源代码中搜索所有href）

5 个答案: