Question

我有一个包含html的字符串。我想用C＃从超链接中获取所有href值目标字符串
<a href="~/abc/cde" rel="new">Link1</a> <a href="~/abc/ghq">Link2</a>
我想得到值“〜/ abc / cde”和“〜/ abc / ghq”

Answer 1

使用HTML Agility Pack解析HTML。在examples page上，他们有一个解析href值的HTML的例子：

 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    HtmlAttribute att = link["href"];

    // Do stuff with attribute value
 }

Answer 2

使用正则表达式来解析HTML是不可取的（想一想评论中的文字等）。

也就是说，以下正则表达式应该可以解决问题，如果需要，还可以在标记中提供链接HTML：

Regex regex = new Regex(@"\<a\s[^\<\>]*?href=(?<quote>['""])(?<href>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</a\s*\>).)*)\</a\s*\>", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture);
for (Match match = regex.Match(inputHtml); match.Success; match=match.NextMatch()) {
  Console.WriteLine(match.Groups["href"]);
}

Answer 3

这是正则表达式的一个片段（使用IgnoreWhitespace选项）：

(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
# -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!

这将为您提供每个标签，您可以过滤掉所需的内容并定位您想要的属性。

我在我的博客（C# Regex Linq: Extract an Html Node with Attributes of Varying Types）中写了更多关于此的内容。

正则表达式：从超链接获取url值

3 个答案: