Question

我正在使用apache nutch 2.3（最新版本）。我用nutch爬了大约49000个文件。从文档mime分析，爬行数据包含大约45000 thouse和text / html文档。但是当我在solr（4.10.3）中看到索引文档时，只有大约14000个文档被编入索引。为什么文件之间存在巨大差异（45000-14000 = 31000）。如果我假设nutch只索引text / html文档，那么至少应该将45000个文档编入索引。

问题是什么怎么解决？

Answer 1

在我的情况下，这个问题是由于缺少nutch-site.xml中的solr索引器信息。当我更新配置时，此问题已得到解决。请在索引步骤中检查您的抓取工具日志。在我的情况下，它被告知没有找到solr索引器插件。

以下行（属性）添加在nutch-site.xml

中

<property>
  <name>plugin.includes</name>
 <value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>plugin details here </description>
</property>

Answer 2

您应该查看Solr日志，查看是否有关于“重复”文档的任何内容，或者只需在solrconfig.xml文件中查找要将文档推送到的核心。可能在更新处理程序上进行了“重复数据删除”调用，使用的字段可能导致删除重复文档（基于几个字段）。你会看到像这样的东西

<requestHandler name="/dataimport"
        class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="update.chain">dedupe</str>     <<-- change dedupe to uuid
        <str name="config">dih-config.xml</str>        or comment the line
    </lst>
</requestHandler>

以及稍后在文件中定义重复数据删除update.chain，

<updateRequestProcessorChain name="dedupe">
     <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">id</str>
         <bool name="overwriteDupes">true</bool>
-->>     <str name="fields">url,date,rawline</str>     <<--
         <str name="signatureClass">solr.processor.Lookup3Signature</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

“fields”元素将选择使用哪个输入数据来确定记录的唯一性。当然，如果您知道输入数据没有重复，那么这不是问题。但是上面的配置会丢弃所有字段上显示的重复记录。

您可能没有使用dataimport requestHandler，而是使用“update”requestHandler。我不确定Nutch使用哪一个。或者，您可以简单地注释掉update.chain，将其更改为不同的processorChain，例如“uuid”，或者在“fields”声明中添加更多字段。

Apache nutch没有将所有文件索引到apache solr

2 个答案: