Question

我有一些文字可能包含这样的链接：

<a rel="nofollow" target="_blank" href="http://loremipsum.net/">http://loremipsum.net/</a>
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, <a rel="nofollow" target="_blank" href="http://loremipsum.net/">http://loremipsum.net/</a> sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

我想在本文中找到链接（a标签），那是什么样的正则表达式？

这种模式不起作用：

const string UrlPattern = @"(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?";
var urlMatches = Regex.Matches(text, UrlPattern);

感谢

Answer 1

我建议使用HtmlAgilityPack来解析HTML（可从NuGet获取）：

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNode.SelectNodes("//a[@href]")
               .Select(a => a.Attributes["href"].Value);

结果：

[
  "http://loremipsum.net/",
  "http://loremipsum.net/"
]

建议阅读：Parsing Html The Cthulhu Way

Answer 2

也许是这样

Regex regexObj = new Regex(@"<a.+?href=(['|""])(.+?)\1");
resultString = regexObj.Match(subjectString).Groups[2].Value;

匹配列表

StringCollection resultList = new StringCollection();

Regex regexObj = new Regex(@"<a.+?href=(['|""])(.+?)\1");
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Groups[2].Value);
    matchResult = matchResult.NextMatch();
}

Answer 3

您应该使用XML解析器，它在此类任务中更加健壮和可靠。但如果您想要非常快速且非常脏，那么它就是：

<a.*?<\/a>

如果这太简单了，您需要捕获链接地址或链接内容，请执行以下操作：

<a.*?href="(?<address>.*?)".*?>(?<content>.*?)<\/a>

他们都没有正确匹配嵌套标签。

用于解析文本中链接的模式

3 个答案: