Question

我正在使用带有弹性搜索功能的nutch-1.15。我想抓取父URL中存在的所有链接并将它们编入索引。但是我不想索引父URL。

'*' == true -> true (string match)
'*' === true -> false (numberic match)

(int)'*' == true -> false
(string)'*' == true -> true

我只希望索引链接1，链接2，链接3等网址，而不是父网址http://someLink.com/cgi-bin/parent.cgi

这怎么办？

Answer 1

 +^(?:https?:\/\/)?(?:www\.)?somelink\.[a-zA-Z0-9.\S]+\/cgi-bin\/.*

进入conf / regex-urlfilter，上一个命令允许您允许以下链接

<http://somelink.com/cgi-bin/link1>
<http://somelink.com/cgi-bin/link2>
<http://somelink.com/cgi-bin/link3> 
<http://somelink.com/cgi-bin/>

如果您在前面放置一个限制，它将起作用

放入conf / regex-urlfilter

-^http:\/\/somelink.com\/cgi-bin\/parent.cgi
+^(?:https?:\/\/)?(?:www\.)?somelink\.[a-zA-Z0-9.\S]+\/cgi-bin\/.*
-^.`

Answer 2

“ index-jexl-filter”插件可将文档排除在索引之外，但仍会对其进行爬网，解析和遵循出站链接。

通过将插件添加到属性“ plugin.includes”来激活该插件
在属性“ index.jexl.filter”中定义一个Jexl expression，该值对于父页面为false。除了URL本身之外，Jexl上下文中还提供HTTP状态，标题和更多变量。如有疑问，请看JexlIndexingFilter类。

您可以轻松测试表达式：

% bin/nutch indexchecker \
  -Dplugin.includes='protocol-okhttp|parse-html|index-(basic|jexl-filter)' \
  -Dindex.jexl.filter=' url != "http://localhost/" ' http://localhost/
fetching: http://localhost/
...
Document discarded by indexing filter

对其他URL进行了索引，即显示了被索引的字段：

% bin/nutch indexchecker \
  -Dplugin.includes='protocol-okhttp|parse-html|index-(basic|jexl-filter)' \
  -Dindex.jexl.filter=' url != "http://localhost/" ' http://localhost/index.html
fetching: http://localhost/index.html
...
title : Apache2 Ubuntu Default Page: It works
url :   http://localhost/index.html
...

防止索引父URL

2 个答案: