JSOUP HTTP错误提取URL。状态= 403

时间:2017-01-20 07:36:34

标签: java parsing jsoup google-search google-search-api

我试图从2016年1月1日到2016年12月31日的特定时间范围内搜索Google新闻内容。

最初,此代码之前有效。运行几次后,会出现 http 错误。

我不知道我是否设置 userclient 不正确或被 GOOGLE 阻止了?

>线程" main"中的例外情况org.jsoup.HttpStatusException:HTTP错误提取URL。状态= 403,网址= http://ipv4.google.com/sorry/index?continue=http://www.google.com/search%253Fq%253Dstackoverflow%2526tbm%253Dnws%2526tbs%253Dcdr%2525253A1%2525252Ccd_min%2525253A5%2525252F30%2525252F2016%2525252Ccd_max%2525253A6%2525252F30%2525252F2016%2526start%253D0&q=EgTKLTckGKH5hsQFIhkA8aeDS-3IYZmr41q-m4rIMh7Uw7vC3wdLMgNyY24             at org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:679)             at org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:676)             在org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:628)             在org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)             在org.jsoup.helper.HttpConnection.get(HttpConnection.java:249)             在javaapplication3.JavaApplication3.main(JavaApplication3.java:36)

代码在这里:

public static void main(String[] args) throws UnsupportedEncodingException, IOException {

        String google = "http://www.google.com/search?q=";

        String search = "stackoverflow";

        String charset = "UTF-8";

        String news="&tbm=nws";

  String string = google + URLEncoder.encode(search , charset) + news+"&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2016%2Ccd_max%3A12%2F31%2F2016";
     String userAgent ="Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"; 
     int numberOfResultpages = 10; // grabs first two pages of search results
for (int i = 0; i < numberOfResultpages; i++) {
       Document document = Jsoup.connect(string).userAgent(userAgent) .data("start",""+i).get();
    Elements links = document.select( ".r>a");

        for (Element link : links) {

            String title = link.text();
            String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
            url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");

            if (!url.startsWith("http")) {
                continue; // Ads/news/etc.
            }
            System.out.println("Title: " + title);
            System.out.println("URL: " + url);
        }
}
    }

0 个答案:

没有答案