Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("http:.*?>");
StringBuilder sb = new StringBuilder();
WebClient client = new WebClient();
string source = client.DownloadString("http://google.com");
foreach (Match m in hrefs.Matches(source)){
sb.Append(http.Match(m.ToString()));
Console.WriteLine(http.Match(m.ToString()));
}
代码工作正常,但只是一次问题 看看输出。
http://images.google.se/imghp?hl=sv&tab=wi" onclick=gbar.qs(this) class=gb1>
http://video.google.se/?hl=sv&tab=wv" onclick=gbar.qs(this) class=gb1>
http://maps.google.se/maps?hl=sv&tab=wl" onclick=gbar.qs(this) class=gb1>
http://news.google.se/nwshp?hl=sv&tab=wn" onclick=gbar.qs(this) class=gb1>
http://translate.google.se/?hl=sv&tab=wT" onclick=gbar.qs(this) class=gb1>
http://mail.google.com/mail/?hl=sv&tab=wm" class=gb1>
http://www.google.se/intl/sv/options/" onclick="this.blur();gbar.tg(event);return !1" aria-haspopup=true class=gb3>
http://blogsearch.google.se/?hl=sv&tab=wb" onclick=gbar.qs(this) class=gb2>
http://www.youtube.com/?hl=sv&tab=w1&gl=SE" onclick=gbar.qs(this) class=gb2>
http://www.google.com/calendar/render?hl=sv&tab=wc" class=gb2>
http://picasaweb.google.se/home?hl=sv&tab=wq" onclick=gbar.qs(this) class=gb2>
http://docs.google.com/?hl=sv&tab=wo" class=gb2>
http://www.google.se/reader/view/?hl=sv&tab=wy" class=gb2>
http://sites.google.com/?hl=sv&tab=w3" class=gb2>
http://groups.google.se/grphp?hl=sv&tab=wg" onclick=gbar.qs(this) class=gb2>
http://www.google.se/ig%3Fhl%3Dsv%26source%3Diglk&usg=AFQjCNEsLWK4azJkUc3KrW46JTUSjK4vhA" class=gb4>
http://www.google.se/" class=gb4>
http://www.google.com/intl/sv/landing/games10/index.html">
http://www.google.com/ncr">
如何删除html标签?
答案 0 :(得分:11)
将正则表达式更改为:
Regex http = new Regex("http:.*?\"");
甚至更好,使用HtmlAgilityPack和Xpath解析所有链接:
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[@href]"); // Will find all links
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
答案 1 :(得分:2)
快速解决方案是改变这一点:
Regex http = new Regex("http:.*?>");
对此:
Regex http = new Regex("http:.*?\"");
更好的解决方案是使用库来解析html - HTML Agility Pack可用于此,并使您的生活更轻松。
答案 2 :(得分:0)
将以下行http.Match(m.ToString())子串到http.Match(m.ToString()。remove(m.ToString()。IndexOf(“\”“)))
不是最干净的方式,但它有效
答案 3 :(得分:0)
将结束标记更改为"
答案 4 :(得分:0)
Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("(http:.*?)\"");
StringBuilder sb = new StringBuilder();
WebClient client = new WebClient();
string source = client.DownloadString("http://google.com");
foreach (Match m in hrefs.Matches(source))
{
var value = http.Match(m.ToString()).Groups[1].Value;
sb.Append(value);
Console.WriteLine(value);
}
答案 5 :(得分:0)
一个简单明了的解决方案。匹配http:除“character
”之后的任何字符"http:[^\"]*"
答案 6 :(得分:0)
FilesAndImages:<\s*(?<Tag>(applet|embed|frame|img|link|script|xml))\s*.*?(?<AttributeName>(src|href|xhref))\s*=\s*[\"\'](?<FileOrImage>.*?)[\"\']
HyperLinks:<\s*(?<Tag>(a|form|frame))\s*.*?(?<AttributeName>(action|href|src))\s*=\s*[\"\'](?<HyperLink>.*?)[\"\']