Java:尝试使用HTMLUnit读取网页时出现503错误

时间:2016-07-17 01:20:35

标签: java html htmlunit

我一直在测试HTMLUnit,我想知道我是否能从某些网站中获得价值。

尝试使用https://rsbuddy.com/exchange?id12934之后,我似乎得到了一些503错误。

似乎与CloudFlare的IUAM存在某种冲突。

我环顾四周,找到了this site,其中有人遇到了和我一样的问题。社区告诉海报,HTMLUnit可以解决他们的问题,但最终似乎没有解决问题,但是没有解决方案。

目前我的代码看起来很简单:

final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://rsbuddy.com/exchange?id12934");
System.out.println(page.asXml());

输出:

INFO:
<!DOCTYPE HTML>
<html lang="en-US">

<head>
  <meta charset="UTF-8" />
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
  <meta name="robots" content="noindex, nofollow" />
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
  <title>Just a moment...</title>
  <style type="text/css">
    html,
    body {
      width: 100%;
      height: 100%;
      margin: 0;
      padding: 0;
    }
    body {
      background-color: #ffffff;
      font-family: Helvetica, Arial, sans-serif;
      font-size: 100%;
    }
    h1 {
      font-size: 1.5em;
      color: #404040;
      text-align: center;
    }
    p {
      font-size: 1em;
      color: #404040;
      text-align: center;
      margin: 10px 0 0 0;
    }
    #spinner {
      margin: 0 auto 30px auto;
      display: block;
    }
    .attribution {
      margin-top: 20px;
    }
    @-webkit-keyframes bubbles {
      33%: {
        -webkit-transform: translateY(10px);
        transform: translateY(10px);
      }
      66% {
        -webkit-transform: translateY(-10px);
        transform: translateY(-10px);
      }
      100% {
        -webkit-transform: translateY(0);
        transform: translateY(0);
      }
    }
    @keyframes bubbles {
      33%: {
        -webkit-transform: translateY(10px);
        transform: translateY(10px);
      }
      66% {
        -webkit-transform: translateY(-10px);
        transform: translateY(-10px);
      }
      100% {
        -webkit-transform: translateY(0);
        transform: translateY(0);
      }
    }
    .bubbles {
      background-color: #404040;
      width: 15px;
      height: 15px;
      margin: 2px;
      border-radius: 100%;
      -webkit-animation: bubbles 0.6s 0.07s infinite ease-in-out;
      animation: bubbles 0.6s 0.07s infinite ease-in-out;
      -webkit-animation-fill-mode: both;
      animation-fill-mode: both;
      display: inline-block;
    }
  </style>

  <script type="text/javascript">
    //<![CDATA[
    (function() {
      var a = function() {
          try {
            return !!window.addEventListener
          } catch (e) {
            return !1
          }
        },
        b = function(b, c) {
          a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)
        };
      b(function() {
        var a = document.getElementById('cf-content');
        a.style.display = 'block';
        setTimeout(function() {
          var s, t, o, p, b, r, e, a, k, i, n, g, f, MASOuLk = {
            "eMSgRDgS": +((!+[] + !![] + !![] + !![] + []) + (!+[] + !![] + !![] + !![] + !![] + !![] + !![] + !![] + !![]))
          };
          t = document.createElement('div');
          t.innerHTML = "<a href='/'>x</a>";
          t = t.firstChild.href;
          r = t.match(/https?:\/\//)[0];
          t = t.substr(r.length);
          t = t.substr(0, t.length - 1);
          a = document.getElementById('jschl-answer');
          f = document.getElementById('challenge-form');;
          MASOuLk.eMSgRDgS -= +((!+[] + !![] + !![] + !![] + !![] + []) + (+[]));
          MASOuLk.eMSgRDgS -= +((!+[] + !![] + !![] + !![] + !![] + []) + (+[]));
          MASOuLk.eMSgRDgS += +((!+[] + !![] + []) + (!+[] + !![] + !![] + !![] + !![] + !![]));
          MASOuLk.eMSgRDgS *= +((!+[] + !![] + !![] + []) + (!+[] + !![] + !![] + !![] + !![] + !![] + !![]));
          MASOuLk.eMSgRDgS *= +((+!![] + []) + (!+[] + !![] + !![] + !![]));
          MASOuLk.eMSgRDgS *= +((!+[] + !![] + !![] + !![] + []) + (!+[] + !![] + !![] + !![]));
          MASOuLk.eMSgRDgS += +((!+[] + !![] + !![] + !![] + []) + (!+[] + !![] + !![] + !![]));
          a.value = parseInt(MASOuLk.eMSgRDgS, 10) + t.length;
          '; 121'
          f.submit();
        }, 4000);
      }, false);
    })();
     //]]>
  </script>


</head>

<body>
  <table width="100%" height="100%" cellpadding="20">
    <tr>
      <td align="center" valign="middle">
        <div class="cf-browser-verification cf-im-under-attack">
          <noscript>
            <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
          </noscript>
          <div id="cf-content" style="display:none">
            <div>
              <div class="bubbles"></div>
              <div class="bubbles"></div>
              <div class="bubbles"></div>
            </div>
            <h1><span data-translate="checking_browser">Checking your browser before accessing</span> rsbuddy.com.</h1>
            <p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
            <p data-translate="allow_5_secs">Please allow up to 5 seconds&hellip;</p>
          </div>
          <form id="challenge-form" action="/cdn-cgi/l/chk_jschl" method="get">
            <input type="hidden" name="jschl_vc" value="c4f4252fa3ee7b54a685f74ba192d186" />
            <input type="hidden" name="pass" value="1468717381.249-GOgXzrnovV" />
            <input type="hidden" id="jschl-answer" name="jschl_answer" />
          </form>
        </div>


        <div class="attribution">
          <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=iuam" target="_blank" style="font-size: 12px;">DDoS protection by CloudFlare</a>
          <br>Ray ID: 2c39c577c5bb41cf
        </div>
      </td>
    </tr>
  </table>
</body>

</html>

Exception in thread "main" com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 503 Service Temporarily Unavailable for https://rsbuddy.com/exchange?id12934 at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:570)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:395) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:303) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:450) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:435)
at TestMain.main(TestMain.java:20) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

有没有办法使用HTMLUnit来连接到网站?

1 个答案:

答案 0 :(得分:1)

有一段时间等待检查浏览器版本,我相信如果你:

WebClient webClient = new WebClient(BrowserVersion.CHROME);

首先设置浏览器版本。然后运行该行以获取页面:

final HtmlPage page = webClient.getPage("https://rsbuddy.com/exchange?id12934");

后面几个选项:

我。设置等待的时间:

webClient.waitForBackgroundJavaScript(5000);

while(page.asText().contains("Checking your browser before accessing")){
        webClient.waitForBackgroundJavaScript(100);             
}

II。使用Thread.sleep()而不是等待JS:

Thread.sleep(2000);// replace with this code.

最后打印出来:

System.out.println(page.asXml());