如何配置nutch和solr来搜索视频文件？

时间：2013-07-23 06:44:23

标签： video solr indexing nutch

我安装了Nutch 1.7和Solr 3.6.2并能够搜索和索引xls，doc，pdf＆amp; zip文件。现在我想索引像.avi，.mov

这样的视频文件

我编辑了regex-urlfilter.txt以删除这些扩展类型，但唯一能够编入索引的文件是.flv文件。我知道这是Tika所说的支持，但我不需要对视频文件进行元数据索引，我只想将文件名编入索引。

我该如何启用？

的正则表达式-urlfilter.txt

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|gz|rpm|tgz|exe|jpeg|JPEG|bmp|BMP)$

的nutch-site.xml中

<configuration>

<property>
     <name>http.agent.name</name>
      <value>crawler</value>
</property>

<property>
      <name>http.robots.agents</name>
      <value>crawler,*</value>
</property>

<property>
      <name>http.accept.language</name>
      <value>zh-cn, ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
      <description>Value of the “Accept-Language” request header field.
      This allows selecting non-English language as default one to retrieve.
      It is a useful setting for search engines build for certain national group.
      </description>
</property>

<property>
      <name>parser.character.encoding.default</name>
      <value>utf-8</value>
      <description>The character encoding to fall back to when no other information
      is available</description>
</property>

<property>
      <name>http.content.limit</name>
      <value>10000000</value>
      <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
      </description>
</property>

<property>
      <name>file.content.limit</name>
      <value>10000000</value>
      <description>The length limit for downloaded content, in bytes.
       If this value is nonnegative (>=0), content longer than it will be      truncated; otherwise, no truncation at all.
      </description>
</property>

<property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|zip)|index-(basic|anchor|metadata)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property> 

<property>
      <name>metatags.names</name>
      <value>*</value>
      <description> Names of the metatags to extract, separated by;.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
      </description>
</property>

<property>
      <name>index.parse.md</name>
      <value>metatag.description,metatag.keywords</value>
      <description> Comma-separated list of keys to be taken from the parse metadata to generate fields.  Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin)
      </description>
</property>

</configuration>

0 个答案:

没有答案