I index some html-documents with solr 6.6.0. There is a lot of Link-text in the content field, which dilute the search results. So, how do I remove the tag-content in the "content"-field befor indexing/storing in Solr? Is there a way about the updateRequestProcessorChain? Anybody knows a solutions?
答案 0 :(得分:0)
在索引时间内,在字段定义中使用HTMLStripCharFilterFactory
作为过滤器。
此Char过滤器从输入流中删除HTML
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer ...>
[...]
</analyzer>
答案 1 :(得分:0)
我通过在文本之前和之后添加隐藏的div来解决问题:
<div style="display:hidden">1%%A</div>
TEXT TEXT TEXT
<div style="display:hidden">1%%E</div>
并添加到solrconfig.xml:
<updateRequestProcessorChain name="myregex">
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">mytextfield</str>
<str name="pattern">([1]{1}%{2}[A]{1})(.*)([1]{1}%{2}[E]{1})</str>
<str name="replacement"> </str>
<bool name="literalReplacement">true</bool>
</processor>
</updateRequestProcessorChain>
它适用于我。