Question

我遇到了一个令人困惑的问题。我真的只做了一天的网络，所以请原谅我，如果我犯了一个愚蠢的错误，我道歉。我的问题是我无法以编程方式访问URL，我可以通过复制粘贴访问Chrome。

我正在使用一个名为jsoup（http://jsoup.org/apidocs/）的库，它从网站的原始html中解析文本。我的目标一般是使用一个基本网址，我可以附加一个字符串，并从中获取一个网页。我正在使用代码（为那些要求更多代码的人编辑，我知道这仍然是稀疏的，但这是错误之前的唯一代码）

String url = "https://www.google.com/search?q=definition+of+";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get(); //url is the String in question

获取网页。我的最终目标是在您搜索单词的定义时使用此方法在chrome搜索的顶部获取框的文本。即顶部的方框：https://www.google.com/search?q=definition+of+apple

但是，当我尝试使用上面的链接作为我的url时，我遇到了一个问题，因为我得到了一个org.jsoup.HttpStatusException，所以我认为这是一个网络问题。键入chrome时这个url工作的原因是什么，而不是Java？（我也不会对在该方框中获取信息的不同方式产生不利影响，因为我目前的方法感觉有点迂回）

完整的错误消息（已编辑）

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.google.com/search?q=definition+of+apple
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at test.Test.parseDef(Test.java:68)
at test.Test.main(Test.java:112)

对于任何人的回答，感谢您花时间帮助网络新手！

Answer 1

最有可能的是，Google正在准确地将您的程序识别为“机器人”并采取相应行动。 Google鼓励机器人使用Google Custom Search API并阻止他们使用面向人的搜索界面。

事实上，所有网络蜘蛛都应该检查robots.txt，对吧？以下是Google的：http://www.google.com/robots.txt。请注意，不允许/搜索。

请参阅此问题以获取更多信息。它基本上是你问题的python版本。 Why does Google Search return HTTP Error 403?

Answer 2

如果使用Jsoup，则必须用％20替换空格而不用+。

试试这个网址： https://www.google.com/search?q=definition%20of%20apple

String url = "https://www.google.com/search?q=definition%20of%20";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get();

Answer 3

public static void main(String[] args) {
    Document doc = Jsoup.connect(link)
        .data("query", "Java")
        .userAgent("Mozilla")
        .cookie("auth", "token")
        .timeout(1000)
        .post();
}

使用Google w / Jsoup无法访问Google Chrome的网址？

3 个答案: