C#迭代链接树结构的最佳方法

时间:2015-07-07 18:48:39

标签: c#

我试图收集一个网站链接列表,从根目录开始可以分支到许多子目录链接,下面是一个简化图形的链接,说明了结构,我是只关心获取绿色链接,黄色链接总是导致其他链接,所以我的输出数组将包含A,B,D,F,G,H,I。我试图用C#编写代码。

enter image description here

1 个答案:

答案 0 :(得分:1)

一般而言,您可以执行类似

的操作
    private static IEnumerable<T> Leaves<T>(T root, Func<T, IEnumerable<T>> childSource)
    {
        var children = childSource(root).ToList();
        if (!children.Any()) {
            yield return root;
            yield break;
        }
        foreach (var descendant in children.SelectMany(child => Leaves(child, childSource)))
        {
            yield return descendant;
        }
    }

这里假设childSource是一个可以获取元素并返回该元素的子元素的函数。在您的情况下,您将要创建一个使用HtmlAgilityPack之类的函数来获取给定的URL,下载它并从中返回链接。

    private static string Get(int msBetweenRequests, string url)
    {
        try
        {
            var webRequest = WebRequest.CreateHttp(url);
            using (var webResponse = webRequest.GetResponse())
            using (var responseStream = webResponse.GetResponseStream())
            using (var responseStreamReader = new StreamReader(responseStream, System.Text.Encoding.UTF8))
            {
                var result = responseStreamReader.ReadToEnd();
                return result;
            }
        }
        catch
        {
            return null; // really nothing sensible to do here
        }
        finally
        {
            // let's be nice to the server we're crawling
            System.Threading.Thread.Sleep(msBetweenRequests);
        }
    }


    private static IEnumerable<string> ScrapeForLinks(string url)
    {
        var noResults = Enumerable.Empty<string>();

        var html = Get(1000, url);
        if (string.IsNullOrWhiteSpace(html)) return noResults;

        var d = new HtmlAgilityPack.HtmlDocument();
        d.LoadHtml(html);
        var links = d.DocumentNode.SelectNodes("//a[@href]");
        return links == null ? noResults :
            links.Select(
                link => 
                    link
                    .Attributes
                    .Where(a => a.Name.ToLower() == "href")
                    .Select(a => a.Value)
                    .First()
             )
             .Select(linkUrl => FixRelativePaths(url, linkUrl))
                    ;

    }

    private static string FixRelativePaths(string baseUrl, string relativeUrl)
    {
        var combined = new Uri(new Uri(baseUrl), relativeUrl);
        return combined.ToString();
    }

请注意,如果在这些页面之间的链接中存在任何循环,那么在一种天真的方法中,您将遇到无限循环。为了缓解这种情况,您希望避免扩大您之前访问过的网址的子女。

    private static Func<string, IEnumerable<string>> DontVisitMoreThanOnce(Func<string, IEnumerable<string>> naiveChildSource)
    {
        var alreadyVisited = new HashSet<string>();
        return s =>
        {
            var children = naiveChildSource(s).Select(RemoveTrailingSlash).ToList();
            var filteredChildren = children.Where(c => !alreadyVisited.Contains(c)).ToList();
            alreadyVisited.UnionWith(children);
            return filteredChildren;
        };
    }

    private static string RemoveTrailingSlash(string url)
    {
        return url.TrimEnd(new[] {'/'});
    }

如果您想阻止您的抓取工具逃到互联网并花时间在Youtube上,您会想要

    private static Func<string, IEnumerable<string>> DontLeaveTheDomain(
        string domain,
        Func<string, IEnumerable<string>> wanderer)
    {
        return u => wanderer(u).Where(l => l.StartsWith(domain));
    }

一旦你定义了这些东西,你想要的只是

    var results = Leaves(
        myUrl,
        DontLeaveTheDomain(
            myDomain, 
            DontVisitMoreThanOnce(ScrapeForLinks)))
        .Distinct()
        .ToList();