JSOUP网站抓取

时间:2016-05-08 20:35:40

标签: jsoup

如何抓取此链接:https://www.higheredjobs.com/details.cfm?JobCode=176261274&Title=Student%20Services%20Advisor

网站似乎有很强的保护。 我试过这样的事情:

Document doc = Jsoup.connect("https://www.higheredjobs.com/details.cfm?JobCode=176261274")
                .timeout(0) // Relax the server by according it infinite time...
                .maxBodySize(0)
                .header("Accept-Encoding", "gzip")
                .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36")
                .referrer("https://www.higheredjobs.com/details.cfm?JobCode=176261274")
                .cookie("D_HID","/iHO5bR2qxwBL4Zf0KOB6W28KoU9Q4K9/Ou8c5S+V3o")
                .cookie("D_UID","E0FB6547-868C-38AC-BAF2-A752089887E0")
                .cookie("D_IID","915B5DF8-71F1-3991-82F0-ED68EBA81949")
///all other cookies
                .get();

但回应是:

<!DOCTYPE html>
<html>
 <head> 
  <meta name="ROBOTS" content="NOINDEX, NOFOLLOW" /> 
  <meta http-equiv="cache-control" content="max-age=0" /> 
  <meta http-equiv="cache-control" content="no-cache" /> 
  <meta http-equiv="expires" content="0" /> 
  <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" /> 
  <meta http-equiv="pragma" content="no-cache" /> 
  <meta http-equiv="refresh" content="10; url=/distil_r_blocked.html?Ref=/details.cfm?JobCode=176261274&amp;distil_RID=90819EF6-155B-11E6-97F5-80CDFCB63E7D&amp;distil_TID=20160508202929" /> 
  <script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script> 
  <script type="text/javascript" src="/ga100524.js" defer=""></script>
  <style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#qczbxvczrsvrrb{display:none!important}</style>
 </head> 
 <body> 
  <div id="distil_ident_block">
   &nbsp;
  </div>   
 </body>
</html>

0 个答案:

没有答案