Nutch不一致地忽略了重定向

时间:2015-02-27 08:38:29

标签: redirect web-crawler nutch

我遇到了爬行(nutch 1.9 / openjdk7)非常简单的重定向案例。 以下是该过程的数据包捕获。

Time        Source          Destination Protocol Info
12.988003   99.99.99.99     8.8.4.4     DNS     Standard query 0xc165  A bloomberg.com
13.032343   8.8.4.4         99.99.99.99 DNS     Standard query response 0xc165  A 69.191.212.191 A 69.191.251.238
13.124471   99.99.99.99 69.191.212.191  HTTP    GET /robots.txt HTTP/1.0 
13.228846   69.191.212.191  99.99.99.99 HTTP    HTTP/1.1 301 Moved Permanently  (text/html)
13.264230   99.99.99.99     8.8.4.4     DNS     Standard query 0x7089  A www.bloomberg.com
13.344767   8.8.4.4         99.99.99.99 DNS     Standard query response 0x7089  CNAME www.bloomberg.com.edgekey.net CNAME e4569.x.akamaiedge.net A 23.214.189.136
13.351030   99.99.99.99 23.214.189.136  HTTP    GET /robots.txt HTTP/1.0 
13.359121   23.214.189.136  99.99.99.99 HTTP    HTTP/1.0 200 OK  (text/plain)
13.448604   99.99.99.99 69.191.212.191  HTTP    GET / HTTP/1.0 
13.537211   69.191.212.191  99.99.99.99 HTTP    HTTP/1.1 301 Moved Permanently  (text/html)
13.640146   99.99.99.99 69.191.212.191  HTTP    GET / HTTP/1.0 
13.738564   69.191.212.191  99.99.99.99 HTTP    HTTP/1.1 301 Moved Permanently  (text/html)

Nutch尝试抓取http://bloomberg.com,回复301重定向到http://www.bloomberg.com。 robots.txt正确处理重定向。但是,对于' get /',fetcher会继续尝试原始主机名,这会继续回复301.无论http.redirect.max多大,抓取失败(我已经检查过10)。

Nutch 1.9正在运行 OpenJDK运行时环境(IcedTea 2.5.3)(7u71-2.5.3-0ubuntu0.12.04.1) OpenJDK客户端VM(构建24.65-b04,混合模式,共享)

这是一个错误(你能否确认一下)或只是一个错误的配置?

感谢。

1 个答案:

答案 0 :(得分:1)