Question

我目前正在尝试使用Apache Nutch 1.13和Solr 4.10.4在爬行运行期间提取网页结构。

根据文档，index-links插件将outlinks和inlinks添加到集合中。

我相应地在Solr中更改了我的集合（通过了schema.xml中的相应字段并重新启动了Solr），并且修改了solr-mapping文件，但无济于事。产生的错误可以在下面看到。

bin/nutch index -D solr.server.url=http://localhost:8983/solr/collection1 crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter -normalize -deleteGone
Segment dir is complete: crawl/segments/20170503114357.
Indexer: starting at 2017-05-03 11:47:02
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 1/1 documents
Deleting 0 documents
Indexing 1/1 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

有趣的是，我自己的研究让我假设它实际上是非平凡的，因为生成的解析（没有插件）看起来像这样：

bin/nutch indexchecker http://www.my-domain.com/
fetching: http://www.my-domain.com/
robots.txt whitelist not configured.
parsing: http://www.my-domain.com/
contentType: application/xhtml+xml
tstamp :    Wed May 03 11:40:57 CEST 2017
digest :    e549a51553a0fb3385926c76c52e0d79
host :  http://www.my-domain.com/
id :    http://www.my-domain.com/
title : Startseite
url :   http://www.my-domain.com/
content :   bla bla bla bla.

然而，一旦我启用index-links，输出突然显示如下：

bin/nutch indexchecker http://www.my-domain.com/
fetching: http://www.my-domain.com/
robots.txt whitelist not configured.
parsing: http://www.my-domain.com/
contentType: application/xhtml+xml
tstamp :    Wed May 03 11:40:57 CEST 2017
outlinks :  http://www.my-domain.com/2-uncategorised/331-links-administratives
outlinks :  http://www.my-domain.com/2-uncategorised/332-links-extern
outlinks :  http://www.my-domain.com/impressum.html
id :    http://www.my-domain.com/
title : Startseite
url :   http://www.my-domain.com/
content :   bla bla bla

显然，这不适合单个字段，但我只想拥有一个包含所有outlinks的列表（我已经读过inlinks不起作用，但我不需要他们反正）。

Answer 1

您必须指定solrindex-mapping.xml中的字段

<field dest="inlinks" source="inlinks"/>
<field dest="outlinks" source="outlinks"/>

然后，确保卸载并重新加载集合，包括完全重启Solr。

您没有详细说明在schema.xml中如何实施字段，但对我来说，以下方法有效：

<!-- fields for index-links plugin -->
<field name="inlinks" type="url" stored="true" indexed="false" multiValued="true"/>
<field name="outlinks" type="url" stored="true" indexed="false" multiValued="true"/>

祝你好运！

Nutch 1.13索引链接配置

1 个答案: