我正在使用带有弹性搜索功能的nutch-1.15。我想抓取父URL中存在的所有链接并将它们编入索引。但是我不想索引父URL。
'*' == true -> true (string match)
'*' === true -> false (numberic match)
(int)'*' == true -> false
(string)'*' == true -> true
我只希望索引链接1,链接2,链接3等网址,而不是父网址http://someLink.com/cgi-bin/parent.cgi
这怎么办?
答案 0 :(得分:0)
+^(?:https?:\/\/)?(?:www\.)?somelink\.[a-zA-Z0-9.\S]+\/cgi-bin\/.*
进入conf / regex-urlfilter,上一个命令允许您允许以下链接
<http://somelink.com/cgi-bin/link1>
<http://somelink.com/cgi-bin/link2>
<http://somelink.com/cgi-bin/link3>
<http://somelink.com/cgi-bin/>
如果您在前面放置一个限制,它将起作用
放入conf / regex-urlfilter
-^http:\/\/somelink.com\/cgi-bin\/parent.cgi
+^(?:https?:\/\/)?(?:www\.)?somelink\.[a-zA-Z0-9.\S]+\/cgi-bin\/.*
-^.`
答案 1 :(得分:0)
“ index-jexl-filter”插件可将文档排除在索引之外,但仍会对其进行爬网,解析和遵循出站链接。
您可以轻松测试表达式:
% bin/nutch indexchecker \
-Dplugin.includes='protocol-okhttp|parse-html|index-(basic|jexl-filter)' \
-Dindex.jexl.filter=' url != "http://localhost/" ' http://localhost/
fetching: http://localhost/
...
Document discarded by indexing filter
对其他URL进行了索引,即显示了被索引的字段:
% bin/nutch indexchecker \
-Dplugin.includes='protocol-okhttp|parse-html|index-(basic|jexl-filter)' \
-Dindex.jexl.filter=' url != "http://localhost/" ' http://localhost/index.html
fetching: http://localhost/index.html
...
title : Apache2 Ubuntu Default Page: It works
url : http://localhost/index.html
...