Question

我试图使用600次不同的搜索从谷歌搜索链接，在此过程中我开始收到以下错误。

错误

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...

现在我已经完成了我的研究，这是因为谷歌学者禁止限制你进行有限的搜索，需要解决捕获才能继续进行，这是jsoup无法做到的。

代码

Document doc = Jsoup.connect("http://google.com/search?q=" + keyWord)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000)
.get();

互联网上的答案非常模糊，并没有提供明确的解决方案，有人确实提到过cookie可以解决这个问题，但没有说过“如何”这样做。

Answer 1

一些改进抓取的提示：

1。使用代理

代理允许您减少被验证码捕获的机会。您应该使用50到150个代理，具体取决于您的平均结果集。以下是两个可以提供代理服务的网站：SEO-proxies.com或Proxify Switch Proxy。

// Setup proxy
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort))

// Fetch url with proxy
Document doc = Jsoup //
               .proxy(proxy) //
               .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
               .header("Content-Language", "en-US") //
               .connect(searchUrl) //
               .get();

2。验证码

如果无论如何，您会被验证码捕获，您可以使用一些在线验证码解决服务（Bypass Captcha，DeathByCaptcha来列举一些）。以下是自动解决验证码的一般性步骤：

检测验证码错误页面

-

try {

  // Perform search here...

} catch(HttpStatusException e) {
    switch(e.getStatusCode()) {
        case java.net.HttpURLConnection.HTTP_UNAVAILABLE:
            if (e.getUrl().contains("http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...")) {
               // Ask online captcha service for help...
            } else {
               // ...
            }
        break;

        default:
        // ...
    } 
}

下载验证码图像（CI）

-

Jsoup                     //
//.cookie(..., ...)       // Some cookies may be needed...
.connect(imageCaptchaUrl) //
.ignoreContentType(true)  // Needed for fetching image
.execute()                //
.bodyAsBytes();           // byte[] array returned...

将CI发送到在线验证码服务

-

这部分取决于验证码服务API。您可以在此8 best captcha solving services文章中找到一些服务。

等待回应......（1-2秒（s）是完美的）
使用回复填写表单并使用Jsoup
发送
Jsoup FormElement在这里是一个救生员。有关详细信息，请参阅此working sample code。

3。其他一些提示

Hints for Google scrapers文章可以为您提供更多改进代码的指示。您将在此处找到前两个提示以及更多提示：

Cookie ：在每次IP更改时清除它们或根本不使用它们
主题：您不应该打开两个连接。每个代理Firefox limits itself to 4 connections。
返回结果：将&num=100附加到您的网址以发送更少的请求
请求率：让您的请求看起来像人。每个IP每24小时不应发送超过500个请求。

参考文献：

Answer 2

作为Stephan答案的替代方案，您可以使用此软件包获取Google搜索结果，而无需代理麻烦。代码示例：

Map<String, String> parameter = new HashMap<>();
parameter.put("q", "Coffee");
parameter.put("location", "Portland");
GoogleSearchResults serp = new GoogleSearchResults(parameter);

JsonObject data = serp.getJson();
JsonArray results = (JsonArray) data.get("organic_results");
JsonObject first_result = results.get(0).getAsJsonObject();
System.out.println("first coffee: " + first_result.get("title").getAsString());

Project Github

如何用Jsoup刮掉Google SERP？

2 个答案:

1。使用代理

2。验证码

3。其他一些提示

参考文献：