Question

我正在开展一个项目，涉及从供应商的网站上抓取一些产品数据（有他们的祝福，但不是他们的帮助）。我在C＃商店工作，所以我使用.NET Windows Forms WebBrowser控件。

我正在响应文档已完成的事件，但我发现我必须稍微调整一下线程，否则数据不会显示在我期望它在DOM中的位置。

在查看页面上的javascript时，我可以看到它在页面加载完成后动态地改变现有的DOM内容（设置someDomElement.innerHTML）。它没有进行任何ajax调用，它使用的是原始页面加载中已有的数据。（我可以尝试解析该数据，但它嵌入在javascript中并且有点混淆。）显然，我以某种方式获取文档已完成事件在javascript运行完毕之前。

最终可能会有很多页面要刮掉，所以等待半秒钟或其他什么东西真的远远不够理想。我想只等到所有在文档就绪/页面加载时启动的JavaScript在我检查页面之前完成运行。有谁知道这样做的方法？

我想文件完成的事件要到那时才开始，对吗？但它肯定是。也许某个页面javascript正在使用setTimeout。有没有办法判断是否有待处理的超时？

感谢您的帮助！

Answer 1

你可以

假设数据的解析永远不会改变，请查看Javascript如何处理数据并在页面加载时立即执行相同操作
将javascript注入网页并检测DOM修改以了解何时从C＃获取数据
使用PhantomJS编写纯JavaScript解决方案

Answer 2

对于后人/其他任何看过这个的人来说，我最终做的是创建一个函数，等待某个特定事物的某个指定的超时时间（符合给定的一组标准的元素）显示在页面上，然后返回它的HtmlElement。它定期检查浏览器dom，寻找特定的东西。它旨在由在后台线程中运行的刮刀工作者调用;它每次检查时都使用一个调用来访问浏览器。

    /// <summary>
    /// Waits for a tag that matches a given criteria to show up on the page.
    /// 
    /// Note: This function returns a browser DOM element from the foreground thread, and this scraper is running in a background thread,
    /// so use an invoke [ scraperForm.Browser.Invoke(new Action(()=>{ ... })); ] when doing anything with the returned DOM element.
    /// </summary>
    /// <param name="tagName">The type of tag, or "" if all tags are to be searched.</param>
    /// <param name="id">The id of the tag, or "" if the search is not to be by id.</param>
    /// <paran name="className">The class name of the tag, or "" if the search is not to be by class name.</paran>
    /// <param name="keyContent">A string to search the tag's innerText for.</param>
    /// <returns>The first tag to match the criteria, or null if such a tag was not found after the timeout period.</returns>
    public HtmlElement WaitForTag(string tagName, string id, string className, string keyContent, int timeout) {
        Log(string.Format("WaitForTag('{0}','{1}','{2}','{3}',{4}) --", tagName, id, className, keyContent, timeout));
        HtmlElement result = null;
        int timeleft = timeout;
        while (timeleft > 0) {
            //Log("time left: " + timeleft);
            // Access the DOM in the foreground thread using an Invoke call.
            // (required by the WebBrowser control, otherwise cryptic errors result, like "invalid cast")
            scraperForm.Browser.Invoke(new Action(() => {
                HtmlDocument doc = scraperForm.CurrentDocument;
                if (id == "") {
                    //Log("no id supplied..");
                    // no id was supplied, so get tags by tag name if a tag name was supplied, or get all the tags
                    HtmlElementCollection elements = (tagName == "") ? doc.All : doc.GetElementsByTagName(tagName);
                    //Log("found " + elements.Count + " '" + tagName + "' tags");
                    // find the tag that matches the class name (if given) and contains the given content (if any)
                    foreach (HtmlElement element in elements) {
                        if (element == null) continue;
                        if (className != "" && !TagHasClass(element, className)) {
                            //Log(string.Format("looking for className {0}, found {1}", className, element.GetAttribute("className")));
                            continue;
                        }
                        if (keyContent == "" || 
                            (element.InnerText != null && element.InnerText.Contains(keyContent)) ||
                            (tagName == "input" && element.GetAttribute("value").Contains(keyContent)) ||
                            (tagName == "img" && element.GetAttribute("src").Contains(keyContent)) || 
                            (element.OuterHtml.Contains(keyContent)))
                        {
                            result = element;
                        }
                        else if (keyContent != "") {
                            //Log(string.Format("searching for key content '{0}' - found '{1}'", keyContent, element.InnerText));
                        }
                    }
                }
                else {
                    //Log(string.Format("searching for tag by id '{0}'", id));
                    // an id was supplied, so get the tag by id 
                    // Log("looking for element with id [" + id + "]");
                    HtmlElement element = doc.GetElementById(id);
                    // make sure it matches any given class name and contains any given content
                    if (
                        element != null 
                        && 
                        (className == "" || TagHasClass(element, className))
                        && 
                        (keyContent == "" || 
                            (element.InnerText != null && element.InnerText.Contains(keyContent))
                        )
                    ) {
                        // Log("  found it");
                        result = element;
                    }
                    else {
                        // Log("  didn't find it");
                    }
                }
            }));
            if (result != null) break;   // the searched for tag appeared, break out of the loop 
            Thread.Sleep(200);           // wait for more milliseconds and continue looping 
            // Note: Make sure sleeps like this are outside of invokes to the foreground thread, so they only pause this background thread.
            timeleft -= 200;
        }
        return result;
    }

c＃WebBrowser-如何在文档加载完成后等待javascript完成运行？

2 个答案: