我正在开展一个项目,涉及从供应商的网站上抓取一些产品数据(有他们的祝福,但不是他们的帮助)。我在C#商店工作,所以我使用.NET Windows Forms WebBrowser控件。
我正在响应文档已完成的事件,但我发现我必须稍微调整一下线程,否则数据不会显示在我期望它在DOM中的位置。
在查看页面上的javascript时,我可以看到它在页面加载完成后动态地改变现有的DOM内容(设置someDomElement.innerHTML)。它没有进行任何ajax调用,它使用的是原始页面加载中已有的数据。 (我可以尝试解析该数据,但它嵌入在javascript中并且有点混淆。)显然,我以某种方式获取文档已完成事件在javascript运行完毕之前。
最终可能会有很多页面要刮掉,所以等待半秒钟或其他什么东西真的远远不够理想。我想只等到所有在文档就绪/页面加载时启动的JavaScript在我检查页面之前完成运行。有谁知道这样做的方法?
我想文件完成的事件要到那时才开始,对吗?但它肯定是。也许某个页面javascript正在使用setTimeout。有没有办法判断是否有待处理的超时?
感谢您的帮助!
答案 0 :(得分:2)
你可以
答案 1 :(得分:1)
对于后人/其他任何看过这个的人来说,我最终做的是创建一个函数,等待某个特定事物的某个指定的超时时间(符合给定的一组标准的元素)显示在页面上,然后返回它的HtmlElement。它定期检查浏览器dom,寻找特定的东西。它旨在由在后台线程中运行的刮刀工作者调用;它每次检查时都使用一个调用来访问浏览器。
/// <summary>
/// Waits for a tag that matches a given criteria to show up on the page.
///
/// Note: This function returns a browser DOM element from the foreground thread, and this scraper is running in a background thread,
/// so use an invoke [ scraperForm.Browser.Invoke(new Action(()=>{ ... })); ] when doing anything with the returned DOM element.
/// </summary>
/// <param name="tagName">The type of tag, or "" if all tags are to be searched.</param>
/// <param name="id">The id of the tag, or "" if the search is not to be by id.</param>
/// <paran name="className">The class name of the tag, or "" if the search is not to be by class name.</paran>
/// <param name="keyContent">A string to search the tag's innerText for.</param>
/// <returns>The first tag to match the criteria, or null if such a tag was not found after the timeout period.</returns>
public HtmlElement WaitForTag(string tagName, string id, string className, string keyContent, int timeout) {
Log(string.Format("WaitForTag('{0}','{1}','{2}','{3}',{4}) --", tagName, id, className, keyContent, timeout));
HtmlElement result = null;
int timeleft = timeout;
while (timeleft > 0) {
//Log("time left: " + timeleft);
// Access the DOM in the foreground thread using an Invoke call.
// (required by the WebBrowser control, otherwise cryptic errors result, like "invalid cast")
scraperForm.Browser.Invoke(new Action(() => {
HtmlDocument doc = scraperForm.CurrentDocument;
if (id == "") {
//Log("no id supplied..");
// no id was supplied, so get tags by tag name if a tag name was supplied, or get all the tags
HtmlElementCollection elements = (tagName == "") ? doc.All : doc.GetElementsByTagName(tagName);
//Log("found " + elements.Count + " '" + tagName + "' tags");
// find the tag that matches the class name (if given) and contains the given content (if any)
foreach (HtmlElement element in elements) {
if (element == null) continue;
if (className != "" && !TagHasClass(element, className)) {
//Log(string.Format("looking for className {0}, found {1}", className, element.GetAttribute("className")));
continue;
}
if (keyContent == "" ||
(element.InnerText != null && element.InnerText.Contains(keyContent)) ||
(tagName == "input" && element.GetAttribute("value").Contains(keyContent)) ||
(tagName == "img" && element.GetAttribute("src").Contains(keyContent)) ||
(element.OuterHtml.Contains(keyContent)))
{
result = element;
}
else if (keyContent != "") {
//Log(string.Format("searching for key content '{0}' - found '{1}'", keyContent, element.InnerText));
}
}
}
else {
//Log(string.Format("searching for tag by id '{0}'", id));
// an id was supplied, so get the tag by id
// Log("looking for element with id [" + id + "]");
HtmlElement element = doc.GetElementById(id);
// make sure it matches any given class name and contains any given content
if (
element != null
&&
(className == "" || TagHasClass(element, className))
&&
(keyContent == "" ||
(element.InnerText != null && element.InnerText.Contains(keyContent))
)
) {
// Log(" found it");
result = element;
}
else {
// Log(" didn't find it");
}
}
}));
if (result != null) break; // the searched for tag appeared, break out of the loop
Thread.Sleep(200); // wait for more milliseconds and continue looping
// Note: Make sure sleeps like this are outside of invokes to the foreground thread, so they only pause this background thread.
timeleft -= 200;
}
return result;
}