我正在使用带有Elasticsearch5.3.3。的螺母1.15。 我想在弹性搜索中解析元标记和索引。我可以这样做,但是在执行indexchecker时看到重复的元标记。
下面是我的nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch Spider</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html|text|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>
<property>
<name>metatags.names</name>
<value>Keywords,Owner</value>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.Keywords,metatag.owner</value>
</property>
<property>
<name>index.content.md</name>
<value>Keywords,owner</value>
</property>
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
<name>elastic.host</name>
<value>localhost</value>
</property>
<property>
<name>elastic.port</name>
<value>9300</value>
</property>
<property>
<name>elastic.cluster</name>
<value>elasticsearch</value>
</property>
<property>
<name>elastic.index</name>
<value>nutch</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.content.limit</name>
<!--value>6553600</value-->
<value>-1</value>
</property>
<property>
<name>elastic.max.bulk.docs</name>
<value>250</value>
<description>Maximum size of the bulk in number of documents.</description>
</property>
<property>
<name>elastic.max.bulk.size</name>
<value>2500500</value>
<description>Maximum size of the bulk in bytes.</description>
</property>
</configuration>
IndexChecker的输出:
]$ bin/nutch indexchecker http://nutch.apache.org/
fetching: http://nutch.apache.org/
robots.txt whitelist not configured.
parsing: http://nutch.apache.org/
contentType: text/html
tstamp : Mon Oct 29 11:17:49 IST 2018
metatag.owner : dev@nutch.apache.org
metatag.owner : dev@nutch.apache.org
digest : da0ffbf19768ea2cab9ffa0fb4a778a7
host : nutch.apache.org
metatag.Keywords : Apache Nutch Web Crawler
metatag.Keywords : Apache Nutch Web Crawler
id : http://nutch.apache.org/
title : Apache Nutch\u2122 -
url : http://nutch.apache.org/
content : Apache Nutch\u2122 -
Downloads
Community
Board Reporting
Robots Information
Contribute
Mailing Lists
Peop
在此,metatag.owner和metatag.Keywords重复两次。 有什么解决办法吗?