激活nutch标题插件的问题

时间:2014-07-10 20:51:15

标签: plugins nutch

我尝试在nutch 1.8中激活标题插件,但不知何故它不起作用。以下是我的nutch-site.xml的部分:

<property>
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|headings)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>activates metatag parsing </description>
</property>

<property>
  <name>headings</name>
  <value>h1;h2</value>
  <description>Comma separated list of headings to retrieve from the document</description>
</property>

<property>
  <name>headings.multivalued</name>
  <value>false</value>
  <description>Whether to support multivalued headings.</description>
</property>

<property>
 <name>index.parse.md</name>
 <value>metatag.description,metatag.title, metatag.keywords, metatag.author, 
metatag.author, headings.h1, headings.h2</value>
<description> Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin)
</description>
</property>

有人可以帮忙吗?

谢谢Chris

2 个答案:

答案 0 :(得分:0)

<name>index.parse.md</name>

检查metatag.h1和metatag.h2

<property>
  <name>index.parse.md</name>
  <value>metatag.h1,metatag.h2/value>
  ...

顺便说一句。标题不是解析-...过滤器。 你必须使用

 <name>plugin.includes</name>
 <value>headings|parse-(html|tika|metatags)|...

现在它应该有用......

答案 1 :(得分:0)

在我自己解决之后,我发现以下内容应该有效(Apache Nutch 1.9):

  <property>
    <name>plugin.includes</name>
    <value>protocol-http|headings|parse-(html|tika|metatags)|...</value>
  </property>
  <property>
    <name>index.parse.md</name>
    <value>h1,h2,h3</value>
  </property>
  <property>
    <name>headings</name>
    <value>h1,h2,h3</value>
  </property>
  <property>
    <name>headings.multivalued</name>
    <value>true</value>
  </property>

以下内容应添加到schema.xml文件中(使用Apache Solr时):

<!-- fields for the headings plugin -->
<field name="h1" type="text" stored="true" indexed="true" multiValued="true"/>
<field name="h2" type="text" stored="true" indexed="true" multiValued="true"/>
<field name="h3" type="text" stored="true" indexed="true" multiValued="true"/>