我正在尝试使用JSoup解析以下URL的html:
http://brickseek.com/walmart-inventory-checker/
当我执行程序时,我遇到异常。我正在使用jsoup-1.10.1.jar
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://brickseek.com/walmart-inventory-checker/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
at Third.main(Third.java:22)
以下是该计划:
import java.io.IOException;
import org.jsoup.Connection.Method;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Third {
public static void main(String[] args) throws IOException {
String uniqueSku ="44656182";
String zipCode ="75160";
Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/")
.data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance")
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2")
.method(Method.POST)
.timeout(0)
.execute();
String rawHTML = response.body();
Document parsedDocument = Jsoup.parse(rawHTML);
Element bodyElement = parsedDocument.body();
Elements inStockTableElement = bodyElement.getElementsByTag("table");
}
}
非常感谢任何帮助。
答案 0 :(得分:3)
服务器可能有某种方法来检测您是否使用机器人刮取页面。尝试将您的http标题更改为以下内容:
public class Util {
public static Connection mask(Connection c) {
return c.header("Host", "brickseek.com")
.header("Connection", "keep-alive")
// .header("Content-Length", ""+c.request().requestBody().length())
.header("Cache-Control", "max-age=0")
.header("Origin", "https://brickseek.com/")
.header("Upgrade-Insecure-Requests", "1")
.header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.referrer("http://brickseek.com/walmart-inventory-checker/")
.header("Accept-Encoding", "gzip, deflate, br")
.header("Accept-Language", "en-US,en;q=0.8");
}
}
此标头完全是从Google Chrome浏览器标题中复制的 - 通常,机器人会通过不同的标题顺序或标题的不同大小来检测。通过完全复制Google Chrome,您应该能够绕过它而无法检测到。
一些机器人检测算法会计算每个IP的请求数量,并开始阻止超过某个阈值 - 这就是为什么它仍适用于某些人。
答案 1 :(得分:2)
只需在代码中添加ignoreHttpErrors(true)即可。
Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/")
.data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance")
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2")
.method(Method.POST)
.timeout(0).ignoreHttpErrors(true)
.execute();
由于