Question

我正在使用Nutch抓取一些网站，并借助自定义插件（myplugin）将数据索引到Elastic Search。

我需要存储在已抓取网站的元标记中的信息。所以为了实现这一点，我只是在nutch-site.xml中添加了属性，如下所示：

<property>
    <name>plugin.includes</name>
    <value>protocol-httpclient|myplugin|urlfilter-regex|parse-(tika|html|js|css|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>

  <property>
        <name>metatags.names</name>
        <value>*</value>
  </property>

  <property>
        <name>index.parse.md</name>
        <value>keywords,description</value>
  </property>

  <property>
        <name>index.content.md</name>
        <value>keywords,description</value>
  </property>

它适用于某些网站，但不适用于this

等网站

任何帮助将不胜感激。

Answer 1

根据Julien nioche提供的答案和提示，您可以将 解析器 - 过滤器插件 更改为这样的内容，以小写所有将在其中解析的元名称现在的问题。

        Metadata newMeta=new Metadata();
        Metadata oldMeta=parse.getData().getParseMeta();
        String metaValue;
        for(String metaName:oldMeta.names()){
          metaValue=oldMeta.get(metaName);
          newMeta.add(metaName.toLowerCase(),metaValue);
        }

        parseData = new ParseData(status, title, parse.getData().getOutlinks(), 
                                      parse.getData().getContentMeta(), newMeta);
        parseResult.put(content.getUrl(), new ParseText(text), parseData);
        return parseResult;

HTH

Answer 2

可能是因为名字是大写的

<meta name="Description" content="...">
<meta name="Keywords" content="...">

也许在配置中尝试案例变体。

你可以使用＆＃39; ./ nutch indexchecker ......＆＃39;测试给定URL上的提取和字段生成。

编辑：https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L92缩小我们要查找的密钥，但解析元数据中的密钥名称可能位于原始大小写中，即大写。

在解决此问题之前，您可以在自己的插件中添加一些自定义代码以小写键，或者修改MetadataIndexer，以便保留大小写或更改逻辑，以便它可以处理案例中的变体。

元标记不是某些网站的索引

2 个答案: