我配置了 apache-nutch-1.12 和 solr-4.6
问题是在Solr中,除了" content"之外的所有字段。我想加入内容"字段,html页面的所有文本。
在apache-nutch-1.12 / conf / solrindex-mapping.xml中,我添加了两个字段:" description"和"关键字"。
<mapping>
<fields>
<field dest="description" source="metatag.description"/>
<field dest="keywords" source="metatag.keywords"/>
<field dest="content" source="content"/>
<field dest="title" source="title"/>
<field dest="host" source="host"/>
<field dest="segment" source="segment"/>
<field dest="boost" source="boost"/>
<field dest="digest" source="digest"/>
<field dest="tstamp" source="tstamp"/>
</fields>
<uniqueKey>id</uniqueKey>
</mapping>
在Solr scheme.xml中,我添加了两个字段:&#34; description&#34;和&#34;关键字&#34;。
<fields>
<field name="description" type="text_general" stored="true" indexed="true" multiValued="true"/>
<field name="keywords" type="text_general" stored="true" indexed="true" multiValued="true"/>
...
<field name="content" type="text_general" stored="true" indexed="true"/>
...
</fields>
在nutch-site.xml中:
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>metatags.names</name>
<value>description,keywords</value>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.keywords</value>
</property>
<property>
<name>index.metadata</name>
<value>description,keywords</value>
</property>
运行:
bin/crawl -i -D solr.server.url=http://localhost:8080/solr/argentina urls/argentina.txt crawler/argentina 1
并在solr中&#34;内容&#34;字段是空的。
Solr的响应示例:
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"indent":"true",
"q":"*:*",
"wt":"json",
"rows":"1000"}},
"response":{"numFound":28,"start":0,"docs":[
{
"tstamp":"2016-11-24T11:08:13.743Z",
"description":["My perfect description"],
"segment":"20161124120734",
"digest":"a449c62842cb0c461ec831a683642af3",
"boost":1.0,
"id":"http://www.my-beauty-site.org/argentina",
"keywords":["myKey1, myKey2, myKey3"],
"url":"http://www.my-beauty-site.org/argentina",
"content":"",
"_version_":1551877664131776512
},
... ... ....
}
}
然后, 我运行以下命令来读取段中的数据:
bin/nutch readseg -dump crawler/argentina/segments/* argentinaDirectory -nocontent -nofetch -nogenerate -noparse -noparsedata
在argentinaDirectory / dump文件中结束&#34; ParseText&#34;字段总是空的。
Recno:: 0
URL:: http://www.my-beauty-site.org/argentina
ParseText::
Recno:: 1
URL:: ... ...
...