Java - Jsoup HTTP错误提取URL

时间:2016-09-18 15:14:22

标签: java jsoup

我正在尝试使用jsoup / java根据用户输入的主题访问Google新闻文章。但是,当我尝试访问Google新闻网页时,我从此行收到运行时错误:

try {
doc = (org.jsoup.nodes.Document) Jsoup.connect("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q="+ "technology").get();
                        } catch (IOException e1) {
                            // TODO Auto-generated catch block
                            e1.printStackTrace();
                        }

当我执行此代码时,我收到此错误:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=technology
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216)
at newsbot.NewsBot.onUpdateReceived(NewsBot.java:93)
at org.telegram.telegrambots.updatesreceivers.BotSession$HandlerThread.run(BotSession.java:197)

但是,如果我在谷歌中键入link,我想要访问的网页完美无缺。我非常感谢你的帮助,谢谢。

2 个答案:

答案 0 :(得分:0)

您需要包含用户代理:

Jsoup.connect("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q="+ "technology")
     .userAgent("blah-blah")
     .get();

答案 1 :(得分:0)

您可以包含用户代理,这样就不会禁止该页面(HTTP 403)

Document doc = (Document) Jsoup
                .connect("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=" + "technology")
                .ignoreContentType(true)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0").get();
        System.out.println(doc);