JSoup HttpStatusException

时间:2016-11-22 20:17:27

标签: jsoup

我正在尝试使用JSoup解析以下URL的html:

http://brickseek.com/walmart-inventory-checker/

当我执行程序时,我遇到异常。我正在使用jsoup-1.10.1.jar

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://brickseek.com/walmart-inventory-checker/
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
    at Third.main(Third.java:22)

以下是该计划:

import java.io.IOException;

import org.jsoup.Connection.Method;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Third {

    public static void main(String[] args)  throws IOException {

        String uniqueSku ="44656182";
        String zipCode ="75160";

        Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/")
                .data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance")
                .userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2")
                .method(Method.POST)
                .timeout(0)
                .execute();

                String rawHTML = response.body();
                Document parsedDocument = Jsoup.parse(rawHTML); 
                Element bodyElement = parsedDocument.body();
                Elements inStockTableElement = bodyElement.getElementsByTag("table");



    }
}

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:3)

服务器可能有某种方法来检测您是否使用机器人刮取页面。尝试将您的http标题更改为以下内容:

public class Util {
    public static Connection mask(Connection c) {
        return c.header("Host", "brickseek.com")
                .header("Connection", "keep-alive")
//              .header("Content-Length", ""+c.request().requestBody().length())
                .header("Cache-Control", "max-age=0")
                .header("Origin", "https://brickseek.com/")
                .header("Upgrade-Insecure-Requests", "1")
                .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36")
                .header("Content-Type", "application/x-www-form-urlencoded")
                .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
                .referrer("http://brickseek.com/walmart-inventory-checker/")
                .header("Accept-Encoding", "gzip, deflate, br")
                .header("Accept-Language", "en-US,en;q=0.8");
    }
}

此标头完全是从Google Chrome浏览器标题中复制的 - 通常,机器人会通过不同的标题顺序或标题的不同大小来检测。通过完全复制Google Chrome,您应该能够绕过它而无法检测到。

一些机器人检测算法会计算每个IP的请求数量,并开始阻止超过某个阈值 - 这就是为什么它仍适用于某些人。

答案 1 :(得分:2)

只需在代码中添加ignoreHttpErrors(true)即可​​。

   Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/")
                .data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance")
                .userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2")
                .method(Method.POST)
                .timeout(0).ignoreHttpErrors(true)
                .execute();

由于