这是我的字符串,我想从href =“拉出来”和使用C#Regex的标签之间的文本中拉出链接。不知道怎么做。
<a href="http://msdn.microsoft.com/en-us/library/Aa538627.aspx" onclick="trackClick(this, '117', 'http\x3a\x2f\x2fmsdn.microsoft.com\x2fen-us\x2flibrary\x2fAa538627.aspx', '15');">ToolStripItemOwnerCollectionUIAdapter.GetInsertingIndex Method ...</a>
答案 0 :(得分:4)
不要使用正则表达式来解析HTML(如@hsz所述)。了解原因:RegEx match open tags except XHTML self-contained tags。您可以使用像HtmlAgilityPack这样的HTML解析器代替它:
var html = @"<a href=""http://msdn.microsoft.com/en-us/library/Aa538627.aspx"" onclick=""trackClick(this, '117', 'http\x3a\x2f\x2fmsdn.microsoft.com\x2fen-us\x2flibrary\x2fAa538627.aspx', '15');"">ToolStripItemOwnerCollectionUIAdapter.GetInsertingIndex Method ...</a>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
var href = link.Attributes["href"].Value;
var innerText = link.InnerText;
}
现在href
包含http://msdn.microsoft.com/en-us/library/Aa538627.aspx
; innerText
(AKA 标记之间的字符串)包含ToolStripItemOwnerCollectionUIAdapter.GetInsertingIndex Method ...
。
这不比正则表达容易吗?
答案 1 :(得分:2)
这显示了如何执行您要查找的内容:C# Scraping HTML Links
以下是该页面的代码示例:
using System.Collections.Generic;
using System.Text.RegularExpressions;
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
static class LinkFinder
{
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
}