Question

我只需要索引HTML中的纯文本并拒绝所有其他HTML标签。

例如：我有类似html的

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>
       title
    </title>
    <link href="./test.html" rel="StyleSheet" type="text/css" />
    </head>
    <body>
      <h1 style="height: 22px">
       header
      </h1>
    </body>
</html>

我只希望在body标记下索引“标头”文本，并拒绝solr的_text_字段中的所有其他HTML标记。

我尝试过<charFilter class="solr.HTMLStripCharFilterFactory"/>，如下所示：

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

但是它仍然索引HTML标记属性

根据solr文档，它不应索引HTML标签solr.HTMLStripCharFilterFactory

当我搜索solr/testcore/select?q=_text_:height&wt=json时，它给了我一个不应该的记录。

我同时尝试了solr-5.3.1和solr-6.6.0。

我坚持了这个，请帮帮我。

Answer 1

自从您将HTML原始内容发布到Solr以来，the extracting request handler ("Solr Cell")正在处理它-它使用Apache Tika从HTML文件中提取内容。

这意味着_text_字段根本看不到HTML标记，因为内容已经被Apache Tika提取且HTML标记已消失-因此没有要删除的内容

如果您以选择的编程语言使用Solr客户端并直接提交HTML作为字段值，则HTML剥离将按您期望的那样进行（因为标签实际上是提交给该字段的内容的一部分在Solr内部进行输入）。

我尝试在捆绑的Tika版本中找到configuring the HTML Parser的某种方式-它使用the Tagsoup library进行解析，但看不到任何公开的配置会改变您的体验。< / p>

在Solr中仅索引HTML中的纯文本

1 个答案: