Question

我正在尝试从Metacritic获取一些视频游戏数据，并且我在此网页上一直收到404错误：

http://www.metacritic.com/game/playstation-2/ico

connect命令非常基本：

Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36").timeout(0).get();

在Metacritic的数百个类似的视频游戏网页中，我尝试过连接，这是唯一每次都返回404的网页。知道为什么吗？

Answer 1

服务器返回404。

$ curl -I http://www.metacritic.com/game/playstation-2/ico
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=UTF-8
Server: Apache
X-Varnish: 868026494
Date: Tue, 10 Sep 2013 15:26:21 GMT
Connection: keep-alive

也返回非404内容的事实不会影响Jsoup;它只是查看服务器在HTTP标头中提供的代码。

欢迎来到craptastic“有什么用？？”互联网的世界。 :)有趣的是，curl -I http://www.metacritic.com/game/playstation-2/SDKFJSDF返回HTTP标头代码200 OK，但显示的内容显示为404的页面。我是否提到互联网中充满了垃圾？

您可以通过在Connection.Request对象上调用ignoreHttpErrors(true)来忽略这些错误。

Answer 2

我意识到这对你的问题已经很晚了，但我今天遇到了这个问题，终于意识到Metacritic搞砸了。看起来他们有一个apache配置，每当请求* ico文件（或大多数图像）时都会提供404错误。他们可能会有这样的设置：

RewriteRule (js|ico|gif|jpg|png|css|xml)$ - [R=404,L,NC]]

他们在这些扩展之前错过了一段时间。因此，任何以这些词结尾的内容，即使它们是游戏名称的一部分，也会返回带有内容的404。证明：

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foojpg'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foojpgz'
HTTP/1.1 200 OK

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/fooxml'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foocss'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foojs'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/fooico'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/fooicoo'
HTTP/1.1 200 OK

我发现有点有趣:)无论如何，神秘解决了。

为什么在使用Jsoup连接到特定网页时出现404错误？

2 个答案: