HtmlUnit在服务器上的行为与本地不同

时间:2018-08-04 14:12:05

标签: java server web-crawler htmlunit

我使用HtmlUnit连接到页面并获取一些信息。我的问题是在连接上。在我的本地PC上,它可以正常工作,并可以连接到例如link / leagues.asp。 Im scraping页面使用cookie同意,而htmlunit应该通过单击按钮来同意。正如我之前所说的,它在本地工作,我被重定向到我想要的页面。但是在我的服务器(VPS)上,我遇到了错误。我重定向到此页面上不存在的index.htm->当我“单击”按钮接受cookie时出现404错误。

我认为可能是含饼干的东西? 这是我的代码:

 public HtmlPage connect(WebClient client, String... urlToConnect) {
    // do not show javascript errors of the page
    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
    System.setProperty("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");

    // Instanciate the page that will be given to the connecting method
    HtmlPage page = null;

    // set the right link to use and format it if needed because some URLs of soccerstats are broken.
    String urlToUse = urlToConnect.length > 0 ? urlToConnect[0] : this.url;

    // remove unneccessary link parts
    urlToUse = urlToUse.contains("(a)") ? urlToUse.replace("(a)", "") : urlToUse;
    urlToUse = urlToUse.contains("(jor)") ? urlToUse.replace("(a)", "") : urlToUse;

    try {
        client.getOptions().setJavaScriptEnabled(true);
        client.getOptions().setCssEnabled(false);
        client.getOptions().setThrowExceptionOnScriptError(false);
        client.getOptions().setThrowExceptionOnFailingStatusCode(true);
        client.waitForBackgroundJavaScript(this.waitForJavascript);
        client.getOptions().setDownloadImages(false);
        client.getOptions().setGeolocationEnabled(true);
        client.getOptions().setAppletEnabled(false);
        client.getOptions().setTimeout(2000);
        client.getOptions().setActiveXNative(false);
        client.getOptions().setPopupBlockerEnabled(false);
        client.getOptions().setRedirectEnabled(true);
        client.getOptions().setUseInsecureSSL(true);
        client.getCookieManager().setCookiesEnabled(true);
        client.getCache().setMaxSize(0);

        JavaScriptEngine js = new JavaScriptEngine(client);
        client.setJavaScriptEngine(js);

        WebRequest webRequest = new WebRequest(new URL(urlToUse.trim()));
        webRequest.setCharset(Charset.forName("UTF-8"));

        page = client.getPage(webRequest);
        int status = page.getWebResponse()
                .getStatusCode();


        // if error connecting, wait for a time and try again
        if (status != 200) {
            System.out.println("[CONNECTION] Got StatusCode " + status + " for " + urlToUse);
            System.out.println("[ERROR] Could not connect to " + urlToUse + ". Retrying in " + this.retryInterval + "...");

            Thread.sleep(this.retryInterval);

            return this.connect(client); // retry recursive
        } else {
            System.out.println("[CONNECTION] Got StatusCode " + status + " for " + urlToUse);

            // check if cookie consent page
            if (page.getTitleText().equals("SoccerSTATS.com - cookie consent")) {
                System.out.println("[CONNECTION] Accepting cookies for " + urlToUse);
                HtmlButton btnCookie = (HtmlButton) page.getByXPath("//button[@class=\"button button3\"]").get(0);
                page = btnCookie.click();
            } else {
                this.connect(client, this.url);
            }
            System.out.println("\n[OK] Connected to: " + urlToUse);
        }
    } catch (IOException | InterruptedException e) {
        this.connect(client, urlToUse);
        reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
    } catch (FailingHttpStatusCodeException failingHttpStatusCodeException) {
        System.out.println("[ERROR] Could not connect! " + failingHttpStatusCodeException.getMessage() + " | URL: " + urlToUse);
        reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
        try {
            Thread.sleep(this.retryInterval);
            this.connect(client, urlToUse);
        } catch (InterruptedException e) {
            this.connect(client, urlToUse);
            reporter.reportError("Could not establish connection", "Could not connect to: " + urlToUse, "Connection Error");
        }
    }
    return page;
}

我试图抓取的页面是:www.soccerstats.com

有人可以帮助我吗?可能有机会模拟浏览器的真实用户吗?或使用安全Cookie,以确保该页面不会出现在任何http通话中?

0 个答案:

没有答案