我试图收集一个网站链接列表,从根目录开始可以分支到许多子目录链接,下面是一个简化图形的链接,说明了结构,我是只关心获取绿色链接,黄色链接总是导致其他链接,所以我的输出数组将包含A,B,D,F,G,H,I。我试图用C#编写代码。
答案 0 :(得分:1)
一般而言,您可以执行类似
的操作 private static IEnumerable<T> Leaves<T>(T root, Func<T, IEnumerable<T>> childSource)
{
var children = childSource(root).ToList();
if (!children.Any()) {
yield return root;
yield break;
}
foreach (var descendant in children.SelectMany(child => Leaves(child, childSource)))
{
yield return descendant;
}
}
这里假设childSource是一个可以获取元素并返回该元素的子元素的函数。在您的情况下,您将要创建一个使用HtmlAgilityPack之类的函数来获取给定的URL,下载它并从中返回链接。
private static string Get(int msBetweenRequests, string url)
{
try
{
var webRequest = WebRequest.CreateHttp(url);
using (var webResponse = webRequest.GetResponse())
using (var responseStream = webResponse.GetResponseStream())
using (var responseStreamReader = new StreamReader(responseStream, System.Text.Encoding.UTF8))
{
var result = responseStreamReader.ReadToEnd();
return result;
}
}
catch
{
return null; // really nothing sensible to do here
}
finally
{
// let's be nice to the server we're crawling
System.Threading.Thread.Sleep(msBetweenRequests);
}
}
private static IEnumerable<string> ScrapeForLinks(string url)
{
var noResults = Enumerable.Empty<string>();
var html = Get(1000, url);
if (string.IsNullOrWhiteSpace(html)) return noResults;
var d = new HtmlAgilityPack.HtmlDocument();
d.LoadHtml(html);
var links = d.DocumentNode.SelectNodes("//a[@href]");
return links == null ? noResults :
links.Select(
link =>
link
.Attributes
.Where(a => a.Name.ToLower() == "href")
.Select(a => a.Value)
.First()
)
.Select(linkUrl => FixRelativePaths(url, linkUrl))
;
}
private static string FixRelativePaths(string baseUrl, string relativeUrl)
{
var combined = new Uri(new Uri(baseUrl), relativeUrl);
return combined.ToString();
}
请注意,如果在这些页面之间的链接中存在任何循环,那么在一种天真的方法中,您将遇到无限循环。为了缓解这种情况,您希望避免扩大您之前访问过的网址的子女。
private static Func<string, IEnumerable<string>> DontVisitMoreThanOnce(Func<string, IEnumerable<string>> naiveChildSource)
{
var alreadyVisited = new HashSet<string>();
return s =>
{
var children = naiveChildSource(s).Select(RemoveTrailingSlash).ToList();
var filteredChildren = children.Where(c => !alreadyVisited.Contains(c)).ToList();
alreadyVisited.UnionWith(children);
return filteredChildren;
};
}
private static string RemoveTrailingSlash(string url)
{
return url.TrimEnd(new[] {'/'});
}
如果您想阻止您的抓取工具逃到互联网并花时间在Youtube上,您会想要
private static Func<string, IEnumerable<string>> DontLeaveTheDomain(
string domain,
Func<string, IEnumerable<string>> wanderer)
{
return u => wanderer(u).Where(l => l.StartsWith(domain));
}
一旦你定义了这些东西,你想要的只是
var results = Leaves(
myUrl,
DontLeaveTheDomain(
myDomain,
DontVisitMoreThanOnce(ScrapeForLinks)))
.Distinct()
.ToList();