HtmlUnit连接到VPS上的页面

时间:2018-08-05 00:10:41

标签: java web-crawler htmlunit

我使用HtmlUnit连接到页面并获取一些信息。我的问题是在连接上。在我的本地PC上,它可以正常工作,并可以连接到例如link / leagues.asp。 Im scraping页面使用cookie同意,而htmlunit应该通过单击按钮来同意。正如我之前所说的,它在本地工作,我被重定向到我想要的页面。但是在我的服务器(VPS)上,我遇到了错误。我重定向到此页面上不存在的index.htm->当我“单击”按钮接受cookie时出现404错误。

我认为可能是含饼干的东西?这是我的代码:

 public HtmlPage connect(WebClient client, String... urlToConnect) {
// do not show javascript errors of the page
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
System.setProperty("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");

// Instanciate the page that will be given to the connecting method
HtmlPage page = null;

// set the right link to use and format it if needed because some URLs of soccerstats are broken.
String urlToUse = urlToConnect.length > 0 ? urlToConnect[0] : this.url;

// remove unneccessary link parts
urlToUse = urlToUse.contains("(a)") ? urlToUse.replace("(a)", "") : urlToUse;
urlToUse = urlToUse.contains("(jor)") ? urlToUse.replace("(a)", "") : urlToUse;

try {
    client.getOptions().setJavaScriptEnabled(true);
    client.getOptions().setCssEnabled(false);
    client.getOptions().setThrowExceptionOnScriptError(false);
    client.getOptions().setThrowExceptionOnFailingStatusCode(true);
    client.waitForBackgroundJavaScript(this.waitForJavascript);
    client.getOptions().setDownloadImages(false);
    client.getOptions().setGeolocationEnabled(true);
    client.getOptions().setAppletEnabled(false);
    client.getOptions().setTimeout(2000);
    client.getOptions().setActiveXNative(false);
    client.getOptions().setPopupBlockerEnabled(false);
    client.getOptions().setRedirectEnabled(true);
    client.getOptions().setUseInsecureSSL(true);
    client.getCookieManager().setCookiesEnabled(true);
    client.getCache().setMaxSize(0);

    JavaScriptEngine js = new JavaScriptEngine(client);
    client.setJavaScriptEngine(js);

    WebRequest webRequest = new WebRequest(new URL(urlToUse.trim()));
    webRequest.setCharset(Charset.forName("UTF-8"));

    page = client.getPage(webRequest);
    int status = page.getWebResponse()
            .getStatusCode();


    // if error connecting, wait for a time and try again
    if (status != 200) {
        System.out.println("[CONNECTION] Got StatusCode " + status + " for " + urlToUse);
        System.out.println("[ERROR] Could not connect to " + urlToUse + ". Retrying in " + this.retryInterval + "...");

        Thread.sleep(this.retryInterval);

        return this.connect(client); // retry recursive
    } else {
        System.out.println("[CONNECTION] Got StatusCode " + status + " for " + urlToUse);

        // check if cookie consent page
        if (page.getTitleText().equals("SoccerSTATS.com - cookie consent")) {
            System.out.println("[CONNECTION] Accepting cookies for " + urlToUse);
            HtmlButton btnCookie = (HtmlButton) page.getByXPath("//button[@class=\"button button3\"]").get(0);
            page = btnCookie.click();
        } else {
            this.connect(client, this.url);
        }
        System.out.println("\n[OK] Connected to: " + urlToUse);
    }
} catch (IOException | InterruptedException e) {
    this.connect(client, urlToUse);
    reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
} catch (FailingHttpStatusCodeException failingHttpStatusCodeException) {
    System.out.println("[ERROR] Could not connect! " + failingHttpStatusCodeException.getMessage() + " | URL: " + urlToUse);
    reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
    try {
        Thread.sleep(this.retryInterval);
        this.connect(client, urlToUse);
    } catch (InterruptedException e) {
        this.connect(client, urlToUse);
        reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
    }
}
return page;

}

我试图抓取的页面是:www.soccerstats.com

有人可以帮助我吗?可能有机会模拟浏览器的真实用户吗?或使用安全Cookie,以确保该页面不会出现在任何http通话中?

0 个答案:

没有答案