如何抓取此链接:https://www.higheredjobs.com/details.cfm?JobCode=176261274&Title=Student%20Services%20Advisor?
网站似乎有很强的保护。 我试过这样的事情:
Document doc = Jsoup.connect("https://www.higheredjobs.com/details.cfm?JobCode=176261274")
.timeout(0) // Relax the server by according it infinite time...
.maxBodySize(0)
.header("Accept-Encoding", "gzip")
.userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36")
.referrer("https://www.higheredjobs.com/details.cfm?JobCode=176261274")
.cookie("D_HID","/iHO5bR2qxwBL4Zf0KOB6W28KoU9Q4K9/Ou8c5S+V3o")
.cookie("D_UID","E0FB6547-868C-38AC-BAF2-A752089887E0")
.cookie("D_IID","915B5DF8-71F1-3991-82F0-ED68EBA81949")
///all other cookies
.get();
但回应是:
<!DOCTYPE html>
<html>
<head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_blocked.html?Ref=/details.cfm?JobCode=176261274&distil_RID=90819EF6-155B-11E6-97F5-80CDFCB63E7D&distil_TID=20160508202929" />
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script type="text/javascript" src="/ga100524.js" defer=""></script>
<style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#qczbxvczrsvrrb{display:none!important}</style>
</head>
<body>
<div id="distil_ident_block">
</div>
</body>
</html>