在solr中从tika解析器获取html文件的文本

时间:2017-02-24 19:27:14

标签: html solr apache-tika

我正在尝试使用Solr 6自动索引html文件.solrconfig.xml文件如下所示:

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="fmap.meta">ignored_</str>
  <str name="fmap.content">_text_</str>
</lst>
</requestHandler>

这是默认配置。我不明白Tika如何产生fmap.content字段。

例如,命令./bin/post -c myexample -params "extractOnly=true&wt=ruby&indent=yes" -out yes docs/SYSTEM_REQUIREMENTS.html

的输出

产生以下输出:

{
  'responseHeader'=>{
    'status'=>0,
    'QTime'=>12},
  ''=>'<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta
name="stream_size" content="869"/>
<meta name="X-Parsed-By"
            content="org.apache.tika.parser.DefaultParser"/>
<meta
name="X-Parsed-By"
            content="org.apache.tika.parser.html.HtmlParser"/>
<meta
name="stream_content_type" content="text/html"/>
<meta name="dc:title"
            content="System Requirements"/>
<meta
name="Content-Encoding" content="UTF-8"/>
<meta name="Content-Type-Hint"
            content="text/html; charset=UTF-8"/>
<meta
name="resourceName"
            content="/home/szr163/search441/indexer/solr-6.3.0/docs/SYSTEM_REQUIREMENTS.html"/>
<meta
name="Content-Type"
                content="text/html; charset=UTF-8"/>
<title>System Requirements</title>
</head>
<body>
<h1>System Requirements</h1>

<p>Apache Solr runs on Java 8 or greater.</p>

<p>It is also recommended to always use the latest update version of your Java VM, because bugs may affect Solr. An overview of known JVM bugs can be found on <a
                shape="rect" href="http://wiki.apache.org/lucene-java/JavaBugs">http://wiki.apache.org/lucene-java/JavaBugs</a>
</p>

<p>With all Java versions it is strongly recommended to not use experimental <code>-XX</code> JVM options.</p>

<p>CPU, disk and memory requirements are based on the many choices made in implementing Solr (document size, number of documents, and number of hits retrieved to name a few). The benchmarks page has some information related to performance on particular platforms. </p>

</body>
</html>
',
  'null_metadata'=>[
    'stream_size',['869'],
    'X-Parsed-By',['org.apache.tika.parser.DefaultParser',
      'org.apache.tika.parser.html.HtmlParser'],
    'stream_content_type',['text/html'],
    'dc:title',['System Requirements'],
    'Content-Encoding',['UTF-8'],
    'Content-Type-Hint',['text/html; charset=UTF-8'],
    'resourceName',['/home/szr163/search441/indexer/solr-6.3.0/docs/SYSTEM_REQUIREMENTS.html'],
    'title',['System Requirements'],
    'Content-Type',['text/html; charset=UTF-8']]}

<meta name="stream_size" >是否会被Solr解释为字段stream_size的标记,并且该标记的content会被视为值?为什么html中的文本不在任何此类标记内?

1 个答案:

答案 0 :(得分:0)

示例techproducts配置集包括

'一个copyField指令,它使所有内容都在预定义的“全能”文本字段中编入索引,以启用包含所有字段内容的单字段搜索。

也许您可以将您的配置与techproducts配置进行比较,并更好地理解它。否则你需要显示更多的配置。

https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode

是的,显然你会得到一个名为stream_size的Solr字段,其值为869.但是如你所知,你有'extract_only',它只解析文件并且不对其进行索引。