Question

我正在关注在https://wiki.apache.org/nutch/IndexReplace发布的Nutch IndexReplace插件的文档，并尝试设置正则表达式，该表达式将创建有关将从该URL派生的内容类型的其他字段存储信息。

以下是已添加到我的conf / nutch-site.xml文件中的属性：

<property>
  <name>index.replace.regexp</name>
  <value>
    url:content_type=/.*wiki.example.com.*/wiki/
    url:content_type=/.*www.example.com.*/website/
  </value>
</property>

目标是通过wiki或网站创建和填充其他字段content_type，具体取决于从哪个网址获取页面。 url和content_type这两个字段都填充在我的solr实例中，但它们都包含完整的url，例如

sample fetched url: https://wiki.example.com/home
value of Solr field url: https://wiki.example.com/home
value of Solr field content_type: https://wiki.example.com/home

所以看起来正则表达式没有按照预期在Nutch中进行评估，尽管它在http://www.ocpsoft.org/tutorials/regular-expressions/java-visual-regex-tester/的在线正则表达式测试器中按预期进行评估。

请您澄清正确的正则表达式语法是什么，这样对于上面提到的示例输入网址，字段的评估如下？

url: http://wiki.example.com/home
content_type: wiki

Answer 1

正则表达式正常工作，但问题是第二个正则表达式覆盖了第一个正则表达式的效果。以下给出了期望的效果（请注意，仅当urlmatch被评估为true时才应用正则表达式）：

<property>
  <name>index.replace.regexp</name>
  <value>
    urlmatch=.*wiki.example.com.*
    url:content_type=/.*wiki.example.com.*/wiki/
    urlmatch=.*www.example.com.*
    url:content_type=/.*www.example.com.*/website/
  </value>
</property>

为Nutch的index.replace.regexp插件指定正则表达式的语法？

1 个答案: