jsoup返回的不同输出比Web浏览器?

时间:2015-02-19 06:58:00

标签: java web-scraping web-crawler jsoup url-redirection

这是我的以下代码

 String url="http://www.yellowbook.com/link/?listingId=1862997071&listingTypeId=2&sessionUID=893a4035-a985-4321-bb47-575fe281f266&visitorUID=9aad070b-3f42-4c4a-82d9-9fd92ca407d3&searchUID=a8f2e487-999f-47eb-bb46-4b945115f732&webRequestUID=101f1af8-956d-4723-bc26-b2df55155ce9&userId=marchex&siteId=40&website=1";    // on browser this returns http://www.wardsdiscountcarpet.com/

    Map<String, String> cookies = Jsoup.connect(url).followRedirects(true).userAgent("Chrome").execute().cookies();
    Document document = Jsoup.connect(url).followRedirects(true).userAgent("Chrome").cookies(cookies).get();
    System.out.println(document.html());

当我没有通过用户代理时,它会抛出416个http代码但是当我使用上面提到的代码时,我得到的输出与浏览器不同。

程序的输出:

&#13;
&#13;
<!DOCTYPE html>
<html>
 <head> 
  <meta name="ROBOTS" content="NOINDEX, NOFOLLOW"> 
  <meta http-equiv="cache-control" content="max-age=0"> 
  <meta http-equiv="cache-control" content="no-cache"> 
  <meta http-equiv="expires" content="0"> 
  <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT"> 
  <meta http-equiv="pragma" content="no-cache"> 
  <meta http-equiv="refresh" content="10; url=/distil_r_blocked.html?Ref=/link/?listingId=1862997071&amp;listingTypeId=2&amp;sessionUID=893a4035-a985-4321-bb47-575fe281f266&amp;visitorUID=9aad070b-3f42-4c4a-82d9-9fd92ca407d3&amp;searchUID=a8f2e487-999f-47eb-bb46-4b945115f732&amp;webRequestUID=101f1af8-956d-4723-bc26-b2df55155ce9&amp;userId=marchex&amp;siteId=40&amp;website=1&amp;distil_RID=6296EE6C-B808-11E4-A524-C2F8C9906A54&amp;distil_TID=20150219072448"> 
  <script type="text/javascript" src="/ga.965296210610.js?PID=6D4E4D1D-7094-375D-A439-0568A6A70836" defer></script>
  <style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#hurriedea1b9a9d,#persuaded1ada68fe,#variable60a06b50,#persuaded1ada68fe{display:none!important}</style>
 </head> 
 <body> 
  <div id="distil_ident_block">
   &nbsp;
  </div> 
  <div id="d__fFH">
   <object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object>
   <span id="d__fF"></span>
  </div>  
 </body>
</html>
&#13;
&#13;
&#13;

0 个答案:

没有答案