我正在使用HTMLAgility包从我的一个Pinterest板上刮下所有图像。当应该有更多项目时,我的代码只返回25个结果。如何从电路板上刮下所有图像标签?
使用浏览器控件加载DOM,以便我们可以在抓取之前等待它:
private void LoadHtmlWithBrowser(String url, string dir)
{
webBrowser1.ScriptErrorsSuppressed = true;
webBrowser1.Navigate(url);
waitTillLoad(this.webBrowser1);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument;
StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML);
doc.Load(sr);
Scraper.ScrapeBoard(doc, dir);
}
将DOM传递给此函数,该函数遍历所有图像标记
public static bool ScrapeBoard(HtmlDocument document, string dir)
{
//var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
//string dir = DateTime.Now.ToShortDateString().Replace("/", "_") + url.Replace("https://www.", "_");
Directory.CreateDirectory(dir);
string localFilename = "";
foreach (string s in urls)
{
try
{
localFilename = dir + "/" + Path.GetFileName(s);
using (WebClient client = new WebClient())
{
client.DownloadFile(s, localFilename);
}
}
catch (Exception ex)
{
return false;
}
}
return true;
}
确保在继续
之前加载整个页面的功能 private void waitTillLoad(WebBrowser webBrControl)
{
WebBrowserReadyState loadStatus;
int waittime = 100000;
int counter = 0;
while (true)
{
loadStatus = webBrControl.ReadyState;
Application.DoEvents();
if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive))
{
break;
}
counter++;
}
counter = 0;
while (true)
{
loadStatus = webBrControl.ReadyState;
Application.DoEvents();
if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true)
{
break;
}
counter++;
}
}
当我检查返回的DOM(Stringreader sr)时,它只显示25个图像标记。为什么不使用上述技术提取或加载其余部分?
答案 0 :(得分:0)
要使页面完全加载,您必须登录您的pinterest帐户。此外,对于大多数电路板,您必须向下滚动,因为它会加载更多图片。