Question

我正在研究一个接受字符串（html代码）的方法，并返回一个包含所有链接的数组。

我已经看过像html ability pack这样的东西的一些选项但它似乎比这个项目要求的更复杂

我也对使用正则表达式感兴趣，因为我对它一般没有多少经验，我认为这将是一个很好的学习机会。

到目前为止，我的代码是

 WebClient client = new WebClient();
            string htmlCode = client.DownloadString(p);
            Regex exp = new Regex(@"http://(www\.)?([^\.]+)\.com", RegexOptions.IgnoreCase);
            string[] test = exp.Split(htmlCode);

但是我没有得到我想要的结果，因为我还在处理正则表达式

我正在寻找的sudo代码是“

Answer 1

如果您正在寻找一个简单的解决方案，正则表达式不是您的答案。它们基本上是有限的，由于HTML语言的复杂性，它们不能用于从HTML文件中可靠地解析链接或其他标记。

Long Winded版本：http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

相反，您需要使用实际的HTML DOM API来解析链接。

Answer 2

正则表达式不是HTML的最佳选择。

见以前的问题：

当使用正则表达式是明智的吗？使用HTML？
匹配所有文本内容的正则表达式 HTML输入

相反，你想要一些已经知道如何解析DOM的东西;否则，你正在重新发明轮子。

Answer 3

其他用户可能会告诉你“不，停！正则表达式不应与HTML混合！就像混合漂白剂和氨水一样！”。这个建议有很多智慧，但这不是完整的故事。

事实是，正则表达式可以很好地收集通常格式化的链接。但是，更好的方法是使用专用工具来处理这类事情，例如HtmlAgilityPack。

如果您使用正则表达式，则可能会匹配99.9％的链接，但您可能会错过罕见的意料之外的案例或格式错误的HTML数据。

这是我整理的一个函数，它使用HtmlAgilityPack来满足您的要求：

    private static IEnumerable<string> DocumentLinks(string sourceHtml)
    {
        HtmlDocument sourceDocument = new HtmlDocument();

        sourceDocument.LoadHtml(sourceHtml);

        return (IEnumerable<string>)sourceDocument.DocumentNode
            .SelectNodes("//a[@href!='#']")
                .Select(n => n.GetAttributeValue("href",""));

    }

此函数创建一个新的HtmlAgilityPack.HtmlDocument，将包含HTML的字符串加载到其中，然后使用xpath查询“// a [@href！='＃']”来选择页面上的所有链接不要指向“＃”。然后我使用LINQ扩展选择将HtmlNodeCollection转换为包含href属性值的字符串列表 - 链接指向的位置。

以下是一个使用示例：

        List<string> links = 
            DocumentLinks((new WebClient())
                .DownloadString("http://google.com")).ToList();

        Debugger.Break();

这应该比正则表达式更有效。

Answer 4

您可以查找类似于http / https架构的URL的任何内容。这不是HTML证明，但它会让你看起来像http URL，这是你需要的东西，我怀疑。您可以添加更多sachems和域正则表达式查找看起来像URL“in”href属性（不严格）的内容。

class Program {
    static void Main(string[] args) {
        const string pattern = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
        var regex = new Regex(pattern);
        var urls = new string[] { 
            "href='http://company.com'",
            "href=\"https://company.com\"",
            "href='http://company.org'",
            "href='http://company.org/'",
            "href='http://company.org/path'",
        };

        foreach (var url in urls) {
            Match match = regex.Match(url);
            if (match.Success) {
                Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value);
            }
        }
    }
}

输出：

href ='http://company.com' - ＆gt; http://company.com
  href =“https://company.com” - ＆gt; https://company.com
  href ='http://company.org' - ＆gt; http://company.org
  href ='http://company.org/' - ＆gt; http://company.org
  href ='http://company.org/path' - ＆gt; http://company.org

正则表达式解析来自html代码的链接

4 个答案: