未在此服务器上配置域

时间:2015-03-26 01:49:51

标签: java http jsoup

我正在实施网络抓取工具,并且我使用InetAddress类从域名获取IP地址。我尝试了域名en.wikipedia.org并获得了ip 208.80.154.224。现在我尝试使用 jSoup 解析器从该服务器获取page /wiki/Cricket,但收到如下错误

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://208.80.154.224/wiki/Cricket
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:459)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
    at OtherClasses.TestDownloadJSoup.main(TestDownloadJSoup.java:30)
Java Result: 1

我的提取页面代码是

Connection con = Jsoup.connect("http://208.80.154.224/wiki/Cricket")
                        .userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
                        .timeout(1000*5)
                        .followRedirects(true)
                        .referrer("http://www.google.com");

我应该怎么做才能解决这个404错误,甚至我在浏览器中写了这个ip,它在这个服务器上没有配置域错误

1 个答案:

答案 0 :(得分:1)

某些服务器可以实现Virtual hosting,这意味着一个服务器(一个IP地址)可以提供多个域名,并根据配置决定要服务的页面。
您应该在查询中添加Host header

System.setProperty("sun.net.http.allowRestrictedHeaders", "true"); // this line is important to allow change in the Host header
Connection con = Jsoup.connect("http://208.80.154.224/wiki/Cricket")
                    .userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
                    .timeout(1000*5)
                    .followRedirects(true)
                    .header("Host","en.wikipedia.org") // new entry here
                    .referrer("http://www.google.com");

有关更多信息,请参阅此answer