Question

我对nutch有点新鲜。我正在抓取一个重定向到另一个网址的网址。现在，在分析我的抓取结果时，我会获得第一个网址的内容以及状态代码：temp重定向到（第二个网址名称）。现在我的问题是，为什么我没有得到第二个网址的内容和详细信息。是否重定向网址被抓取？请帮忙。

Answer 1

同样，在无所不能的nutch-default.xml中，有一个属性可以控制Nutch处理重定向的方式。

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

如描述所述，fetcher won't immediately follow redirected URLs and record them for later fetching。我仍然没有想出如何强制提取 db_redir_temp 中的网址。但是，如果您在开始时更改配置，我认为您的可能可能会消失。

Answer 2

在Nutch2.3.1中，我尝试在我的nutch-site.xml中设置以下属性，它帮助我在下次尝试时获取重定向的URL。对于尝试Nutch 2.3.1的人来说可能会有帮助。

<property>
      <name>db.fetch.interval.default</name>
      <value>0</value>
      <description>The default number of seconds between re-fetches of a page (30 days).
      </description>
  </property>

Answer 3

在Nutch 2.3.1中，类

中有一个名为getProtocolOutput的方法

org.apache.nutch.protocol.http.api.HttpBase

在这种方法中，有一个调用另一个方法

Response response = getResponse(u, page, false); (Line 250)

在前面的代码中将false更改为true

由于此标志指的是followRedirects

然后重新编译nutch类，并按照重定向将正常工作：）

nutch重定向处理问题

3 个答案: