我正在尝试构建一个托管在CentOS 7计算机上的搜索工具,该工具应该索引并搜索已挂载的NFS导出的目录。我发现Nutch + Solr是最好的选择。我很难为此配置网址,因为这不会搜索任何http位置。
挂载位于/ mnt
所以我的seeds.txt看起来像这样:
MasterServer
和我的regex-urlfilter.txt具有相同的网站以及允许文件协议
[root@sauron bin]# cat /root/Desktop/apache-nutch-1.13/urls/seed.txt
file:///mnt
然而,当我尝试从初始种子列表引导时,没有完成更新:
# skip file: ftp: and mailto: urls
-^(http|https|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^file:///mnt
我也尝试过将seeds.txt更改为以下内容而没有运气:
[root@sauron apache-nutch-1.13]# bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-06-12 00:07:49
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2017-06-12 00:10:27, elapsed: 00:02:38
如果我在这里做错了,请告诉我。
答案 0 :(得分:0)
从URI的角度来看,Nutch的文件系统并没有那么不同,你只需要启用protocol-file
插件,并配置regex-urlfilter.txt
就像:
+^file:///mnt/directory/
-.
在这种情况下,您可以阻止它索引您指定的目录的父目录。
请记住,既然您已经在本地安装了NFS共享,那么它将作为普通的本地文件系统运行。更多信息可以在https://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F找到。