Nutch履带需要很长时间

时间:2015-05-13 15:35:41

标签: apache web-crawler nutch

我只想让Nutch给我一个它抓取的网址列表以及该链接的状态。我不需要整个页面内容或绒毛。有没有办法可以做到这一点?抓取991个网址的深度为3的种子列表需要3个多小时才能进行爬网和解析。我希望这会加快速度。

在nutch-default.xml文件中有

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>

<property>
  <name>file.content.ignored</name>
  <value>true</value>
  <description>If true, no file content will be saved during fetch.
  And it is probably what we want to set most of time, since file:// URLs
  are meant to be local and we can always use them directly at parsing
  and indexing stages. Otherwise file contents will be saved.
  !! NO IMPLEMENTED YET !!
  </description>
</property>

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

<property>
  <name>ftp.content.limit</name>
  <value>65536</value> 
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  </description>
</property>

我认为这些属性可能与它有关,但我不确定。有人可以给我一些帮助和澄清吗?此外,我收到了许多状态代码为38的网址。我无法在this文档中找到该状态代码所指示的内容。谢谢你的帮助!

1 个答案:

答案 0 :(得分:1)

Nutch在获取URL后执行解析,以从获取的URL中获取所有外链。来自URL的外链被用作下一轮的新fetchlist。

如果跳过解析,则不会生成新的URL,因此不再提取。 我能想到的一种方法是将解析插件配置为仅包含您需要解析的内容类型(在您的情况下是其外链)。 这里有一个例子 - https://wiki.apache.org/nutch/IndexMetatags

此链接描述了解析器https://wiki.apache.org/nutch/Features

的功能

现在,要仅获取使用其状态获取的网址列表,您可以使用

$bin/nutch readdb crawldb -stats命令。

关于38的状态代码,查看您链接的文档,似乎是URL的状态 public static final byte STATUS_FETCH_NOTMODIFIED = 0x26

因为,Hex(26)对应于Dec(38)。

希望答案给出一些方向:)