在此页面上:https://www.check24.de/dsl/vergleich/ 我尝试通过2.31版的httpUnit实现爬虫,以检查不同提供者的带宽。
如果您手动填写页面上的“地址”字段,您将看到一个弹出窗口,显示带宽检查的进度,然后您将在同一页面上拥有请求地址的可用Internet带宽。 请求的地址在标签中(第一个文本输入字段的位置。
在尝试使用htmlUnit编写爬网程序时,尽管我在(较长的等待时间)之后返回了同一页面,但输入字段没有被字段集中的某些标签替换(id =“ tko-vcheck-done-wrapper” ),显示地址。
这是我的代码:
public Map<String, Integer> checkProviderBandWidthsByAddress(String zip, String city, String street, String hno){
WebClient webClient = null;
try{
webClient = getWebCient();
HtmlPage page = webClient.getPage("https://www.check24.de/dsl/vergleich/");
HtmlTextInput inputZipCity = (HtmlTextInput) page.getElementById("c24api_ac_widget_zipcity");
HtmlHiddenInput inputZip = (HtmlHiddenInput) page.getElementById("c24api_ac_widget_zipcode");
HtmlHiddenInput inputCity = (HtmlHiddenInput) page.getElementById("c24api_ac_widget_city");
HtmlTextInput inputStreet = (HtmlTextInput) page.getElementById("c24api_ac_widget_street");
HtmlTextInput inputStreetNumber = (HtmlTextInput) page.getElementById("c24api_ac_widget_streetnumber");
HtmlButton buttonCheck = (HtmlButton) page.getElementById("tko-filter-vcheck-submit");
inputZipCity.setValueAttribute(zip + " " + city);
inputZipCity.fireEvent(Event.TYPE_INPUT);
page.getWebClient().waitForBackgroundJavaScriptStartingBefore(1000);
inputZip.setValueAttribute(zip);
inputCity.setValueAttribute(city);
inputStreet.setValueAttribute(street);
inputStreetNumber.setValueAttribute(hno);
page = buttonCheck.click();
page.getWebClient().waitForBackgroundJavaScriptStartingBefore(30000);
DomElement done = page.getElementById("tko-vcheck-done-wrapper"); // <-- Probleme here: NULL
List<DomElement> providers = page.getByXPath("//div[contains(@class, 'tko-result-row tko-clearfix')]");
Map<String, Integer> bandWidths = findMaxSpeed(providers); // works fine to read the download BandWith for general tarif - but this dont contains the address-specific bandwith
return bandWidths;
}catch(Exception e){
e.printStackTrace();
return Collections.emptyMap();
}finally {
webClient.close();
}
}
public static WebClient getWebCient(){
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52); // also tried with Other
webClient.setRefreshHandler(new WaitingRefreshHandler());
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.getOptions().setPopupBlockerEnabled(false);
return webClient;
}
如果有人有解决问题的想法,我会很高兴的
答案 0 :(得分:0)
像这样可怕的怪物的页面是HtmlUnit的挑战。 但是,如果您有点耐心,那就可以了。 (我正在使用HtmlUnit版本2.32)
已在示例代码中添加了一些注释;希望能有所帮助。 并且请以代码作为概念证明,没有足够的时间来编写好的代码。
public static void main(String[] args) throws Exception {
String url = "https://www.check24.de/dsl/vergleich/";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
HtmlPage page = webClient.getPage(url);
// this page has starts a lot of javascript
// we have to wait until this is finished to get a page
// that can respond to our typing
wait(webClient, 60);
HtmlTextInput inputZipCity = (HtmlTextInput) page.getElementById("c24api_ac_widget_zipcity");
inputZipCity.type("50126");
wait(webClient, 30);
// System.out.println(page.getElementById("tko-result-filter-form-acsuggest").asXml());
HtmlTextInput inputStreet = (HtmlTextInput) page.getElementById("c24api_ac_widget_street");
HtmlTextInput inputStreetNumber = (HtmlTextInput) page.getElementById("c24api_ac_widget_streetnumber");
inputStreet.type("Hauptstr.");
wait(webClient, 10);
inputStreetNumber.type("10");
wait(webClient, 10);
HtmlButton buttonCheck = (HtmlButton) page.getElementById("tko-filter-vcheck-submit");
buttonCheck.click();
wait(webClient, 4 * 60);
HtmlPage refreshedPage = ((HtmlPage) page.getEnclosingWindow().getEnclosedPage());
// System.out.println("----------------");
// System.out.println(refreshedPage.asText());
System.out.println(refreshedPage.getElementById("tko-result-sorting-text").getTextContent());
}
}
private static void wait(WebClient webClient, int seconds) {
long timeLimit = System.currentTimeMillis() + seconds * 1000;
int scriptCount = webClient.waitForBackgroundJavaScript(1000);
while (scriptCount > 1 && timeLimit > System.currentTimeMillis()) {
scriptCount = webClient.waitForBackgroundJavaScript(1000);
}
// seems like there is always one job in the queue (maybe some kind of heartbeat)
if (scriptCount > 1) {
System.out.println("Still some js is running " + scriptCount);
}
}
至少这会产生类似
的信息68 Tarifeverfügbarvon 12,91€bis 107,47€(Durchschnitt pro Monat)
使用实际浏览器运行时,网站上会显示相同的文本。