刮掉Pinterest板上的所有图像

时间:2017-10-10 21:43:41

标签: dom web-scraping html-agility-pack

我正在使用HTMLAgility包从我的一个Pinterest板上刮下所有图像。当应该有更多项目时,我的代码只返回25个结果。如何从电路板上刮下所有图像标签?

使用浏览器控件加载DOM,以便我们可以在抓取之前等待它:

    private void LoadHtmlWithBrowser(String url, string dir)
    {
        webBrowser1.ScriptErrorsSuppressed = true;
        webBrowser1.Navigate(url);

        waitTillLoad(this.webBrowser1);

        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument;
        StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML);
        doc.Load(sr);

        Scraper.ScrapeBoard(doc, dir);
    }

将DOM传递给此函数,该函数遍历所有图像标记

    public static bool ScrapeBoard(HtmlDocument document, string dir)
    {
        //var document = new HtmlWeb().Load(url);
        var urls = document.DocumentNode.Descendants("img")
                                        .Select(e => e.GetAttributeValue("src", null))
                                        .Where(s => !String.IsNullOrEmpty(s));

        //string dir = DateTime.Now.ToShortDateString().Replace("/", "_") + url.Replace("https://www.", "_");
        Directory.CreateDirectory(dir);

        string localFilename = "";
        foreach (string s in urls)
        {
            try
            {
                localFilename = dir + "/" + Path.GetFileName(s);
                using (WebClient client = new WebClient())
                {
                    client.DownloadFile(s, localFilename);
                }
            }
            catch (Exception ex)
            {
                return false;
            }
        }
        return true;
    }

确保在继续

之前加载整个页面的功能
    private void waitTillLoad(WebBrowser webBrControl)
    {
        WebBrowserReadyState loadStatus;
        int waittime = 100000;
        int counter = 0;
        while (true)
        {
            loadStatus = webBrControl.ReadyState;
            Application.DoEvents();
            if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive))
            {
                break;
            }
            counter++;
        }

        counter = 0;
        while (true)
        {
            loadStatus = webBrControl.ReadyState;
            Application.DoEvents();
            if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true)
            {
                break;
            }
            counter++;
        }
    }

当我检查返回的DOM(Stringreader sr)时,它只显示25个图像标记。为什么不使用上述技术提取或加载其余部分?

1 个答案:

答案 0 :(得分:0)

要使页面完全加载,您必须登录您的pinterest帐户。此外,对于大多数电路板,您必须向下滚动,因为它会加载更多图片。