Question

我正在尝试构建一个托管在CentOS 7计算机上的搜索工具，该工具应该索引并搜索已挂载的NFS导出的目录。我发现Nutch + Solr是最好的选择。我很难为此配置网址，因为这不会搜索任何http位置。

挂载位于/ mnt

所以我的seeds.txt看起来像这样：

MasterServer

和我的regex-urlfilter.txt具有相同的网站以及允许文件协议

[root@sauron bin]# cat /root/Desktop/apache-nutch-1.13/urls/seed.txt
file:///mnt

然而，当我尝试从初始种子列表引导时，没有完成更新：

# skip file: ftp: and mailto: urls
-^(http|https|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^file:///mnt

我也尝试过将seeds.txt更改为以下内容而没有运气：

[root@sauron apache-nutch-1.13]# bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-06-12 00:07:49
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2017-06-12 00:10:27, elapsed: 00:02:38

如果我在这里做错了，请告诉我。

Answer 1

从URI的角度来看，Nutch的文件系统并没有那么不同，你只需要启用protocol-file插件，并配置regex-urlfilter.txt就像：

+^file:///mnt/directory/
-.

在这种情况下，您可以阻止它索引您指定的目录的父目录。

请记住，既然您已经在本地安装了NFS共享，那么它将作为普通的本地文件系统运行。更多信息可以在https://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F找到。

如何使用Nutch索引NFS挂载？

1 个答案: