我想抓取搜索引擎,但搜索查询已编码

时间:2018-02-26 06:44:15

标签: java web-crawler jsoup search-engine

我正在抓一个中文搜索引擎。其中一个搜索网址是“https://xueshu.glgoo.net/scholar?hl=zh-CN&as_sdt=0%2C5&q=%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD&btnG=

原始页面是这样的 Expected page。 但是当我使用JSOUP进行抓取时,获取的页面就像这样Obtained page with JSOUP。 已爬网的网页正在搜索已编码的查询字词。 有没有人可以解决这个问题。

我的代码在这里

    url = "https://xueshu.glgoo.net/scholar?lr=&q=%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD&hl=zh-CN&as_sdt=0,5";
    Connection con = Jsoup.connect(url);
    con.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");  
    con.header("Accept-Encoding", "gzip, deflate, sdch, br");  
    con.header("Accept-Language", "en-US,en;q=0.8");  
    con.header("Connection", "keep-alive"); 
    con.header("Cookie", "NID=122=makZ0dUnna6lDl4kG2AJrr8rj5MhO6kJgX_72-w1ORUGKKLaNim_mhdivMoIF4WVff7NDzBao6RRJNZUbUb0SttOsRnDEWgAh1KxoexAJ7uvO2AECnUJ_Hvx5klujBF4; GSP=LM=1519620995:S=gK3Jhz58HYNpLN1q; Hm_lvt_be2f71fe16256ff72aa6fdbdf058b9f3=1519620973; Hm_lpvt_be2f71fe16256ff72aa6fdbdf058b9f3=1519621078; xid=e97152b3b65b79d1aaded61fa2f71f09");
    con.header("Host", "xueshu.glgoo.net"); 
    con.header("Referer", "https://xue.glgoo.net/scholar?hl=zh-CN&q=%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD&btnG=&lr=");
    con.header("Upgrade-Insecure-Requests", "1");
    con.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36");  
    Document doc = Jsoup.connect(url).get();

0 个答案:

没有答案