我想要的是,从网站打开链接(来自HtmlContent) 并获得这个新开放网站的Html ..
示例:我有www.google.com,现在我想查找所有链接。 对于每个链接,我想要拥有新网站的HTMLContent。
我这样做:
foreach (String link in GetLinksFromWebsite(htmlContent))
{
using (var client = new WebClient())
{
htmlContent = client.DownloadString("http://" + link);
}
foreach (Match treffer in istBildURL)
{
string bildUrl = treffer.Groups[1].Value;
bildLinks.Add(bildUrl);
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
string linkPattern = "<a href=\"(.*?)\">(.*?)</a>";
MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
List<string> linkContents = new List<string>();
foreach (Match match in linkMatches)
{
linkContents.Add(match.Value);
}
return linkContents;
}
另一个问题是,我只获得了链接,而不是链接按钮(ASP.NET).. 我该如何解决这个问题呢?
答案 0 :(得分:7)
要遵循的步骤:
regex
或regular expression
开头的所有内容,并处理解析HTML(阅读this answer以更好地理解原因)。在您的情况下,这将是GetLinksFromWebsite
方法的内容。这是an example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
var htmlSource = client.DownloadString("http://www.stackoverflow.com");
foreach (var item in GetLinksFromWebsite(htmlSource))
{
// TODO: you could easily write a recursive function
// that will call itself here and retrieve the respective contents
// of the site ...
Console.WriteLine(item);
}
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
return doc
.DocumentNode
.SelectNodes("//a[@href]")
.Select(node => node.Attributes["href"].Value)
.ToList();
}
}