通过Nutch从html页面获取文本数据并存储在Solr

时间:2016-11-24 16:26:02

标签: apache solr web-crawler nutch

我配置了 apache-nutch-1.12 solr-4.6

问题是在Solr中,除了" content"之外的所有字段。我想加入内容"字段,html页面的所有文本。

在apache-nutch-1.12 / conf / solrindex-mapping.xml中,我添加了两个字段:" description"和"关键字"。

<mapping>
  <fields>
    <field dest="description" source="metatag.description"/>
    <field dest="keywords" source="metatag.keywords"/>
    <field dest="content" source="content"/>
    <field dest="title" source="title"/>
    <field dest="host" source="host"/>
    <field dest="segment" source="segment"/>
    <field dest="boost" source="boost"/>
    <field dest="digest" source="digest"/>
    <field dest="tstamp" source="tstamp"/>
  </fields>
  <uniqueKey>id</uniqueKey>
</mapping>

在Solr scheme.xml中,我添加了两个字段:&#34; description&#34;和&#34;关键字&#34;。

<fields>
 <field name="description" type="text_general" stored="true" indexed="true" multiValued="true"/>
 <field name="keywords" type="text_general" stored="true" indexed="true" multiValued="true"/>
 ...

 <field name="content" type="text_general" stored="true" indexed="true"/>
 ...
</fields>

在nutch-site.xml中:

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>
<property>
 <name>plugin.includes</name>
 <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 
 </property>
 <property>
  <name>metatags.names</name>
 <value>description,keywords</value>
</property>
<property>
 <name>index.parse.md</name>
 <value>metatag.description,metatag.keywords</value>
</property>
<property>
 <name>index.metadata</name>
 <value>description,keywords</value>
</property>

运行:

bin/crawl -i -D solr.server.url=http://localhost:8080/solr/argentina urls/argentina.txt crawler/argentina  1

并在solr中&#34;内容&#34;字段是空的。

Solr的响应示例:

{
 "responseHeader":{
 "status":0,
 "QTime":1,
 "params":{
 "indent":"true",
      "q":"*:*",
      "wt":"json",
      "rows":"1000"}},
 "response":{"numFound":28,"start":0,"docs":[
    {
        "tstamp":"2016-11-24T11:08:13.743Z",
        "description":["My perfect description"],
        "segment":"20161124120734",
        "digest":"a449c62842cb0c461ec831a683642af3",
        "boost":1.0,
        "id":"http://www.my-beauty-site.org/argentina",
        "keywords":["myKey1, myKey2, myKey3"],
        "url":"http://www.my-beauty-site.org/argentina",
        "content":"",
        "_version_":1551877664131776512
    },
    ... ... ....
    }
}

然后, 我运行以下命令来读取段中的数据:

bin/nutch readseg -dump crawler/argentina/segments/* argentinaDirectory -nocontent -nofetch -nogenerate -noparse -noparsedata

在argentinaDirectory / dump文件中结束&#34; ParseText&#34;字段总是空的。

Recno:: 0
URL:: http://www.my-beauty-site.org/argentina
ParseText::

Recno:: 1
URL:: ... ... 
...

0 个答案:

没有答案