nutch indexchecker两次显示解析元标记

时间:2018-10-29 05:58:58

标签: elasticsearch nutch

我正在使用带有Elasticsearch5.3.3。的螺母1.15。 我想在弹性搜索中解析元标记和索引。我可以这样做,但是在执行indexchecker时看到重复的元标记。

下面是我的nutch-site.xml

<configuration>
<property>
    <name>http.agent.name</name>
    <value>Nutch Spider</value>
</property>

<property>
    <name>plugin.includes</name>
    <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html|text|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>

<property>
    <name>metatags.names</name>
    <value>Keywords,Owner</value>
</property>

<property>
    <name>index.parse.md</name>
    <value>metatag.Keywords,metatag.owner</value>
</property>

<property>
    <name>index.content.md</name>
    <value>Keywords,owner</value>
</property>
<property>
    <name>http.auth.file</name>
    <value>httpclient-auth.xml</value>
    <description>Authentication configuration file for 'protocol-httpclient' plugin.</description>
</property>

<property>
    <name>db.ignore.external.links</name>
    <value>true</value>
</property>

<property>
    <name>elastic.host</name>
    <value>localhost</value>
</property>

<property>
    <name>elastic.port</name>
    <value>9300</value>
</property>
<property>
    <name>elastic.cluster</name>
    <value>elasticsearch</value>
</property>

<property>
    <name>elastic.index</name>
    <value>nutch</value>
</property>

<property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
</property>

<property>
    <name>http.content.limit</name>
    <!--value>6553600</value-->
    <value>-1</value>
</property>

<property>
    <name>elastic.max.bulk.docs</name>
    <value>250</value>
    <description>Maximum size of the bulk in number of documents.</description>
</property>

<property>
    <name>elastic.max.bulk.size</name>
    <value>2500500</value>
    <description>Maximum size of the bulk in bytes.</description>
</property>
</configuration>

IndexChecker的输出:

]$ bin/nutch indexchecker http://nutch.apache.org/
fetching: http://nutch.apache.org/
robots.txt whitelist not configured.
parsing: http://nutch.apache.org/
contentType: text/html
tstamp :    Mon Oct 29 11:17:49 IST 2018
metatag.owner : dev@nutch.apache.org
metatag.owner : dev@nutch.apache.org
digest :    da0ffbf19768ea2cab9ffa0fb4a778a7
host :  nutch.apache.org
metatag.Keywords :  Apache Nutch Web Crawler
metatag.Keywords :  Apache Nutch Web Crawler
id :    http://nutch.apache.org/
title : Apache Nutch\u2122 -
url :   http://nutch.apache.org/
content :   Apache Nutch\u2122 -
Downloads
Community
Board Reporting
Robots Information
Contribute
Mailing Lists
Peop

在此,metatag.owner和metatag.Keywords重复两次。 有什么解决办法吗?

0 个答案:

没有答案