我使用HtmlUnit连接到页面并获取一些信息。我的问题是在连接上。在我的本地PC上,它可以正常工作,并可以连接到例如link / leagues.asp。 Im scraping页面使用cookie同意,而htmlunit应该通过单击按钮来同意。正如我之前所说的,它在本地工作,我被重定向到我想要的页面。但是在我的服务器(VPS)上,我遇到了错误。我重定向到此页面上不存在的index.htm->当我“单击”按钮接受cookie时出现404错误。
我认为可能是含饼干的东西? 这是我的代码:
public HtmlPage connect(WebClient client, String... urlToConnect) {
// do not show javascript errors of the page
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
System.setProperty("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
// Instanciate the page that will be given to the connecting method
HtmlPage page = null;
// set the right link to use and format it if needed because some URLs of soccerstats are broken.
String urlToUse = urlToConnect.length > 0 ? urlToConnect[0] : this.url;
// remove unneccessary link parts
urlToUse = urlToUse.contains("(a)") ? urlToUse.replace("(a)", "") : urlToUse;
urlToUse = urlToUse.contains("(jor)") ? urlToUse.replace("(a)", "") : urlToUse;
try {
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setCssEnabled(false);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(true);
client.waitForBackgroundJavaScript(this.waitForJavascript);
client.getOptions().setDownloadImages(false);
client.getOptions().setGeolocationEnabled(true);
client.getOptions().setAppletEnabled(false);
client.getOptions().setTimeout(2000);
client.getOptions().setActiveXNative(false);
client.getOptions().setPopupBlockerEnabled(false);
client.getOptions().setRedirectEnabled(true);
client.getOptions().setUseInsecureSSL(true);
client.getCookieManager().setCookiesEnabled(true);
client.getCache().setMaxSize(0);
JavaScriptEngine js = new JavaScriptEngine(client);
client.setJavaScriptEngine(js);
WebRequest webRequest = new WebRequest(new URL(urlToUse.trim()));
webRequest.setCharset(Charset.forName("UTF-8"));
page = client.getPage(webRequest);
int status = page.getWebResponse()
.getStatusCode();
// if error connecting, wait for a time and try again
if (status != 200) {
System.out.println("[CONNECTION] Got StatusCode " + status + " for " + urlToUse);
System.out.println("[ERROR] Could not connect to " + urlToUse + ". Retrying in " + this.retryInterval + "...");
Thread.sleep(this.retryInterval);
return this.connect(client); // retry recursive
} else {
System.out.println("[CONNECTION] Got StatusCode " + status + " for " + urlToUse);
// check if cookie consent page
if (page.getTitleText().equals("SoccerSTATS.com - cookie consent")) {
System.out.println("[CONNECTION] Accepting cookies for " + urlToUse);
HtmlButton btnCookie = (HtmlButton) page.getByXPath("//button[@class=\"button button3\"]").get(0);
page = btnCookie.click();
} else {
this.connect(client, this.url);
}
System.out.println("\n[OK] Connected to: " + urlToUse);
}
} catch (IOException | InterruptedException e) {
this.connect(client, urlToUse);
reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
} catch (FailingHttpStatusCodeException failingHttpStatusCodeException) {
System.out.println("[ERROR] Could not connect! " + failingHttpStatusCodeException.getMessage() + " | URL: " + urlToUse);
reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
try {
Thread.sleep(this.retryInterval);
this.connect(client, urlToUse);
} catch (InterruptedException e) {
this.connect(client, urlToUse);
reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
}
}
return page;
}
我试图抓取的页面是:www.soccerstats.com
有人可以帮助我吗?可能有机会模拟浏览器的真实用户吗?或使用安全Cookie,以确保该页面不会出现在任何http通话中?