我正在实施网络抓取工具,并且我使用InetAddress
类从域名获取IP地址。我尝试了域名en.wikipedia.org并获得了ip 208.80.154.224
。现在我尝试使用 jSoup 解析器从该服务器获取page /wiki/Cricket
,但收到如下错误
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://208.80.154.224/wiki/Cricket
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:459)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
at OtherClasses.TestDownloadJSoup.main(TestDownloadJSoup.java:30)
Java Result: 1
我的提取页面代码是
Connection con = Jsoup.connect("http://208.80.154.224/wiki/Cricket")
.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
.timeout(1000*5)
.followRedirects(true)
.referrer("http://www.google.com");
我应该怎么做才能解决这个404错误,甚至我在浏览器中写了这个ip,它在这个服务器上没有配置域错误
答案 0 :(得分:1)
某些服务器可以实现Virtual hosting,这意味着一个服务器(一个IP地址)可以提供多个域名,并根据配置决定要服务的页面。
您应该在查询中添加Host header
System.setProperty("sun.net.http.allowRestrictedHeaders", "true"); // this line is important to allow change in the Host header
Connection con = Jsoup.connect("http://208.80.154.224/wiki/Cricket")
.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
.timeout(1000*5)
.followRedirects(true)
.header("Host","en.wikipedia.org") // new entry here
.referrer("http://www.google.com");
有关更多信息,请参阅此answer