我使用的是solr 6.6.0。 (而核心是用"样本&#34创建的;) 当我使用ExtractingRequestHandler导入富文档(此处为HTML)时,会对不必要的换行代码(\ n)和制表符(\ t)进行索引。 我尝试设置MappingCharFilterFactory等,但它无效。 我也提到了以下网址,但没有效果。
如何防止标签和换行代码(\ n,\ r \ n,\ t)被编入索引?
[我采取的步骤]
填写表格
选择" text.html"上面并执行它。
[响应]
Status: success
Response:
{
"responseHeader": {
"status": 0,
"QTime": 618
}
}
[QueryResults]
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"*:*",
"indent":"on",
"wt":"json",
"_":"1498437444505"}},
"response":{"numFound":1,"start":0,"docs":[
{
"size_d":20.0,
"content_type_s":"text/html",
"filename_txt_ja":"test.html",
"content_txt_ja":" \n \n \n \n \n \n \n \n \n \n \n \n AAA\n\tBBB\n\tCCC\nDDD ",
"id":"5a311ac9-77fe-46a6-8524-4ab40c8ece4a",
"_version_":1571244755499614208}]
}
}
我想消除这些" \ n"和" \ t"在content_txt_ja字段中。
这是我的配置xml文件
[solrconfig.xml中]
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.stream_content_type">content_type_s</str>
<str name="fmap.content">content_txt_ja</str>
<str name="fmap.body">content_txt2_ja</str>
<str name="fmap.stream_name">filename_txt_ja</str>
<str name="fmap.author">author_txt_ja</str>
<str name="fmap.last_author">last_author_txt_ja</str>
<str name="fmap.creation_date">creation_dt</str>
<str name="fmap.last_modified">modified_dt</str>
<str name="fmap.stream_size">size_d</str>
</lst>
</requestHandler>
[托管schema.xml中]
<dynamicField name="*_txt_ja" type="text_ja" indexed="true" stored="true"/>
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\\n)" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\n" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\\n]" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\n]" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\t" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\t" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u000a" replacement=" AAA " />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="\u000A" replacement="," />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="\u000D" replacement=";" />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="\u000D\u000A" replacement="." />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="\\u000A" replacement="," />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="\\u000D" replacement=";" />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="\\u000D\\u000A" replacement="." />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="(\\u000A)" replacement="," />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="(\\u000D)" replacement=";" />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="(\\u000D\\u000A)" replacement="." />
<!--<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>-->
<!--<tokenizer class="solr.JapaneseTokenizerFactory" mode="normal"/>-->
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
<filter class="solr.TrimFilterFactory" />
<!-- Reduces inflected verbs and adjectives to their base/dictionary forms (辞書形) -->
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<!-- Removes tokens with certain part-of-speech tags -->
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
<!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
<filter class="solr.CJKWidthFilterFactory"/>
<!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" />
<!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<!-- Lower-cases romaji characters -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
答案 0 :(得分:2)
索引和存储是两回事。 简单来说: - 索引内容用于执行搜索 - 存储的内容用于在搜索结果中返回
您可以像使用分析链一样从您的索引内容中删除这些特殊字符(我没有测试它们,但它们可能没问题)。 但是从存储的内容中删除这些特殊字符(响应中返回的内容)是另一回事。 您需要在到达Solr之前清理该内容,或者在更新请求处理器时使用一些自定义Solr插件来执行此操作。
如果您不希望它到达您的API响应,您可以只清理中间API层中的solr响应并将干净的内容返回给客户端。
答案 1 :(得分:0)
感谢Alessandro Benedetti和k.se1。按照k.se1的建议,将solrconfig.xml中的“ RegexReplaceProcessorFactory”配置添加到“ updateRequestProcessorChain”,以过滤出/ n,/ t或所需的任何替换项。
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="update.chain">extract</str>
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
<str name="capture">h1</str>
<str name="fmap.h1">h1_content</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="extract">
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">h1_content</str>
<str name="pattern">\n</str>
<str name="replacement"></str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">h1_content</str>
<str name="pattern">\t</str>
<str name="replacement"></str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>