Question

不是基于http，

喜欢http://localhost:81等等，

但直接抓取本地文件系统上的某个目录，

有什么出路吗？

Answer 1

来自Nutch Wiki：

如何索引本地文件系统？

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6

1）crawl-urlfilter.txt需要更改以允许文件：URL而不遵循http：1，否则它将不会索引任何内容，或者它将从磁盘跳到网站上。改变这一行：

  -^(file|ftp|mailto|https):

  to this:

  -^(http|ftp|mailto|https):

2）crawl-urlfilter.txt底部可能有规则拒绝某些网址。如果它有这个片段，那可能没问题：

  # accept anything else +.*

3）我更改了我的nutch.xml以包含以下内容：

<Parameter override="false" name="plugin.includes" value="protocol-file|protocol-http|urlfilter-regex|parse-(msword|pdf|text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"/>

Answer 2

nutch可以使用Intranet爬行。您可以阅读详细信息here

如何制作nutch抓取文件系统？

2 个答案: