我正在尝试使用Solr 6自动索引html文件.solrconfig.xml文件如下所示:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
这是默认配置。我不明白Tika如何产生fmap.content
字段。
例如,命令./bin/post -c myexample -params "extractOnly=true&wt=ruby&indent=yes" -out yes docs/SYSTEM_REQUIREMENTS.html
产生以下输出:
{
'responseHeader'=>{
'status'=>0,
'QTime'=>12},
''=>'<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta
name="stream_size" content="869"/>
<meta name="X-Parsed-By"
content="org.apache.tika.parser.DefaultParser"/>
<meta
name="X-Parsed-By"
content="org.apache.tika.parser.html.HtmlParser"/>
<meta
name="stream_content_type" content="text/html"/>
<meta name="dc:title"
content="System Requirements"/>
<meta
name="Content-Encoding" content="UTF-8"/>
<meta name="Content-Type-Hint"
content="text/html; charset=UTF-8"/>
<meta
name="resourceName"
content="/home/szr163/search441/indexer/solr-6.3.0/docs/SYSTEM_REQUIREMENTS.html"/>
<meta
name="Content-Type"
content="text/html; charset=UTF-8"/>
<title>System Requirements</title>
</head>
<body>
<h1>System Requirements</h1>
<p>Apache Solr runs on Java 8 or greater.</p>
<p>It is also recommended to always use the latest update version of your Java VM, because bugs may affect Solr. An overview of known JVM bugs can be found on <a
shape="rect" href="http://wiki.apache.org/lucene-java/JavaBugs">http://wiki.apache.org/lucene-java/JavaBugs</a>
</p>
<p>With all Java versions it is strongly recommended to not use experimental <code>-XX</code> JVM options.</p>
<p>CPU, disk and memory requirements are based on the many choices made in implementing Solr (document size, number of documents, and number of hits retrieved to name a few). The benchmarks page has some information related to performance on particular platforms. </p>
</body>
</html>
',
'null_metadata'=>[
'stream_size',['869'],
'X-Parsed-By',['org.apache.tika.parser.DefaultParser',
'org.apache.tika.parser.html.HtmlParser'],
'stream_content_type',['text/html'],
'dc:title',['System Requirements'],
'Content-Encoding',['UTF-8'],
'Content-Type-Hint',['text/html; charset=UTF-8'],
'resourceName',['/home/szr163/search441/indexer/solr-6.3.0/docs/SYSTEM_REQUIREMENTS.html'],
'title',['System Requirements'],
'Content-Type',['text/html; charset=UTF-8']]}
<meta name="stream_size" >
是否会被Solr解释为字段stream_size
的标记,并且该标记的content
会被视为值?为什么html中的文本不在任何此类标记内?
答案 0 :(得分:0)
示例techproducts配置集包括
'一个copyField指令,它使所有内容都在预定义的“全能”文本字段中编入索引,以启用包含所有字段内容的单字段搜索。
也许您可以将您的配置与techproducts配置进行比较,并更好地理解它。否则你需要显示更多的配置。
https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode
是的,显然你会得到一个名为stream_size的Solr字段,其值为869.但是如你所知,你有'extract_only',它只解析文件并且不对其进行索引。