较长的Javascript-带宽检查后的htmlUnit文本输入未更新

时间:2018-07-29 17:32:21

标签: javascript java htmlunit

在此页面上:https://www.check24.de/dsl/vergleich/ 我尝试通过2.31版的httpUnit实现爬虫,以检查不同提供者的带宽。

如果您手动填写页面上的“地址”字段,您将看到一个弹出窗口,显示带宽检查的进度,然后您将在同一页面上拥有请求地址的可用Internet带宽。 请求的地址在标签中(第一个文本输入字段的位置。

在尝试使用htmlUnit编写爬网程序时,尽管我在(较长的等待时间)之后返回了同一页面,但输入字段没有被字段集中的某些标签替换(id =“ tko-vcheck-done-wrapper” ),显示地址。

这是我的代码:

   public Map<String, Integer> checkProviderBandWidthsByAddress(String zip, String city, String street, String hno){
    WebClient webClient = null;
    try{
        webClient = getWebCient();            
        HtmlPage page = webClient.getPage("https://www.check24.de/dsl/vergleich/");

        HtmlTextInput inputZipCity = (HtmlTextInput) page.getElementById("c24api_ac_widget_zipcity");
        HtmlHiddenInput inputZip = (HtmlHiddenInput) page.getElementById("c24api_ac_widget_zipcode");
        HtmlHiddenInput inputCity = (HtmlHiddenInput) page.getElementById("c24api_ac_widget_city");
        HtmlTextInput inputStreet = (HtmlTextInput) page.getElementById("c24api_ac_widget_street");
        HtmlTextInput inputStreetNumber = (HtmlTextInput) page.getElementById("c24api_ac_widget_streetnumber");
        HtmlButton buttonCheck = (HtmlButton) page.getElementById("tko-filter-vcheck-submit");

        inputZipCity.setValueAttribute(zip + " " + city);
        inputZipCity.fireEvent(Event.TYPE_INPUT);
        page.getWebClient().waitForBackgroundJavaScriptStartingBefore(1000);
        inputZip.setValueAttribute(zip);
        inputCity.setValueAttribute(city);
        inputStreet.setValueAttribute(street);
        inputStreetNumber.setValueAttribute(hno);

        page = buttonCheck.click();
        page.getWebClient().waitForBackgroundJavaScriptStartingBefore(30000);
        DomElement done = page.getElementById("tko-vcheck-done-wrapper"); // <-- Probleme here: NULL  

        List<DomElement> providers = page.getByXPath("//div[contains(@class, 'tko-result-row tko-clearfix')]");

        Map<String, Integer> bandWidths = findMaxSpeed(providers); // works fine to read the download BandWith for general tarif - but this dont contains the address-specific bandwith
        return bandWidths;
    }catch(Exception e){
            e.printStackTrace();
            return Collections.emptyMap();
    }finally {
        webClient.close();
    }
}

public static WebClient getWebCient(){
    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52); // also tried with Other
    webClient.setRefreshHandler(new WaitingRefreshHandler());
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setCssEnabled(false);
    webClient.setCssErrorHandler(new SilentCssErrorHandler());
    webClient.setAjaxController(new NicelyResynchronizingAjaxController());
    webClient.getOptions().setUseInsecureSSL(true);
    webClient.getOptions().setRedirectEnabled(true);
    webClient.getCookieManager().setCookiesEnabled(true);
    webClient.getOptions().setPopupBlockerEnabled(false);
    return webClient;
}

如果有人有解决问题的想法,我会很高兴的

1 个答案:

答案 0 :(得分:0)

像这样可怕的怪物的页面是HtmlUnit的挑战。 但是,如果您有点耐心,那就可以了。 (我正在使用HtmlUnit版本2.32)

已在示例代码中添加了一些注释;希望能有所帮助。 并且请以代码作为概念证明,没有足够的时间来编写好的代码。

public static void main(String[] args) throws Exception {
    String url = "https://www.check24.de/dsl/vergleich/";

    try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
        HtmlPage page = webClient.getPage(url);

        // this page has starts a lot of javascript
        // we have to wait until this is finished to get a page
        // that can respond to our typing
        wait(webClient, 60);

        HtmlTextInput inputZipCity = (HtmlTextInput) page.getElementById("c24api_ac_widget_zipcity");
        inputZipCity.type("50126");
        wait(webClient, 30);

        // System.out.println(page.getElementById("tko-result-filter-form-acsuggest").asXml());

        HtmlTextInput inputStreet = (HtmlTextInput) page.getElementById("c24api_ac_widget_street");
        HtmlTextInput inputStreetNumber = (HtmlTextInput) page.getElementById("c24api_ac_widget_streetnumber");

        inputStreet.type("Hauptstr.");
        wait(webClient, 10);

        inputStreetNumber.type("10");
        wait(webClient, 10);

        HtmlButton buttonCheck = (HtmlButton) page.getElementById("tko-filter-vcheck-submit");
        buttonCheck.click();
        wait(webClient, 4 * 60);

        HtmlPage refreshedPage = ((HtmlPage) page.getEnclosingWindow().getEnclosedPage());
        // System.out.println("----------------");
        // System.out.println(refreshedPage.asText());
        System.out.println(refreshedPage.getElementById("tko-result-sorting-text").getTextContent());
    }
}

private static void wait(WebClient webClient, int seconds) {
    long timeLimit = System.currentTimeMillis() + seconds * 1000;
    int scriptCount = webClient.waitForBackgroundJavaScript(1000);
    while (scriptCount > 1 && timeLimit > System.currentTimeMillis()) {
        scriptCount = webClient.waitForBackgroundJavaScript(1000);
    }

    // seems like there is always one job in the queue (maybe some kind of heartbeat)
    if (scriptCount > 1) {
        System.out.println("Still some js is running " + scriptCount);
    }
}

至少这会产生类似

的信息

68 Tarifeverfügbarvon 12,91€bis 107,47€(Durchschnitt pro Monat)

使用实际浏览器运行时,网站上会显示相同的文本。