如何在solr中删除\ n或\ t代码

时间:2017-06-26 06:24:33

标签: solr

我使用的是solr 6.6.0。 (而核心是用"样本&#34创建的;) 当我使用ExtractingRequestHandler导入富文档(此处为HTML)时,会对不必要的换行代码(\ n)和制表符(\ t)进行索引。 我尝试设置MappingCharFilterFactory等,但它无效。 我也提到了以下网址,但没有效果。

如何防止标签和换行代码(\ n,\ r \ n,\ t)被编入索引?

[我采取的步骤]

  1. 访问" http://localhost:8983/solr/#/sample/documents"
  2. 选择我的核心(样本)。然后点击"文档"左侧菜单中的链接。
  3. 填写表格

    • 请求处理程序" / update / extract"
    • 文档类型文件上载
    • Documetn(s)test.html
    • 提取需求。处理程序参数*未指定
    • 1000内提交
    • 覆盖真实
  4. 选择" text.html"上面并执行它。

  5. [响应]

    Status: success
    Response:
    {
      "responseHeader": {
        "status": 0,
        "QTime": 618
      }
    }
    

    [QueryResults]

    {
      "responseHeader":{
        "status":0,
        "QTime":0,
        "params":{
          "q":"*:*",
          "indent":"on",
          "wt":"json",
          "_":"1498437444505"}},
      "response":{"numFound":1,"start":0,"docs":[
          {
            "size_d":20.0,
            "content_type_s":"text/html",
            "filename_txt_ja":"test.html",
            "content_txt_ja":" \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n AAA\n\tBBB\n\tCCC\nDDD  ",
            "id":"5a311ac9-77fe-46a6-8524-4ab40c8ece4a",
            "_version_":1571244755499614208}]
      }
    }
    

    我想消除这些" \ n"和" \ t"在content_txt_ja字段中。

    这是我的配置xml文件

    [solrconfig.xml中]

    <requestHandler name="/update/extract" 
                      startup="lazy"
                      class="solr.extraction.ExtractingRequestHandler" >
        <lst name="defaults">
          <str name="lowernames">true</str>
          <str name="uprefix">ignored_</str>
    
          <!-- capture link hrefs but ignore div attributes -->
          <str name="captureAttr">true</str>
          <str name="fmap.meta">ignored_</str>
          <str name="fmap.a">ignored_</str>
          <str name="fmap.div">ignored_</str>
          <str name="fmap.a">ignored_</str>
    
          <str name="fmap.stream_content_type">content_type_s</str>
          <str name="fmap.content">content_txt_ja</str>
          <str name="fmap.body">content_txt2_ja</str>
          <str name="fmap.stream_name">filename_txt_ja</str>
    
          <str name="fmap.author">author_txt_ja</str>
          <str name="fmap.last_author">last_author_txt_ja</str>
    
          <str name="fmap.creation_date">creation_dt</str>
          <str name="fmap.last_modified">modified_dt</str>
          <str name="fmap.stream_size">size_d</str>
    
        </lst>
      </requestHandler>
    

    [托管schema.xml中]

    <dynamicField name="*_txt_ja" type="text_ja"  indexed="true"  stored="true"/>
        <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
          <analyzer>
            <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\\n)" replacement=""/>
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\n" replacement=""/>
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\\n]" replacement=""/>
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\n]" replacement=""/>
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\t" replacement=""/>
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\t" replacement=""/>
    
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u000a" replacement=" AAA " />
    
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="\u000A" replacement="," />
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="\u000D" replacement=";" />
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="\u000D\u000A" replacement="." />
    
    
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="\\u000A" replacement="," />
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="\\u000D" replacement=";" />
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="\\u000D\\u000A" replacement="." />
    
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="(\\u000A)" replacement="," />
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="(\\u000D)" replacement=";" />
            <charFilter class="solr.PatternReplaceFilterFactory" pattern="(\\u000D\\u000A)" replacement="." />
    
    
            <!--<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>--> 
            <!--<tokenizer class="solr.JapaneseTokenizerFactory" mode="normal"/>-->
            <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
    
            <filter class="solr.TrimFilterFactory" />
    
            <!-- Reduces inflected verbs and adjectives to their base/dictionary forms (辞書形) -->
            <filter class="solr.JapaneseBaseFormFilterFactory"/>
            <!-- Removes tokens with certain part-of-speech tags -->
            <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
            <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
            <filter class="solr.CJKWidthFilterFactory"/>
            <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" />
            <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
            <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
            <!-- Lower-cases romaji characters -->
            <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
        </fieldType>
    

2 个答案:

答案 0 :(得分:2)

索引和存储是两回事。 简单来说:   - 索引内容用于执行搜索   - 存储的内容用于在搜索结果中返回

您可以像使用分析链一样从您的索引内容中删除这些特殊字符(我没有测试它们,但它们可能没问题)。 但是从存储的内容中删除这些特殊字符(响应中返回的内容)是另一回事。 您需要在到达Solr之前清理该内容,或者在更新请求处理器时使用一些自定义Solr插件来执行此操作。

如果您不希望它到达您的API响应,您可以只清理中间API层中的solr响应并将干净的内容返回给客户端。

答案 1 :(得分:0)

感谢Alessandro Benedetti和k.se1。按照k.se1的建议,将solrconfig.xml中的“ RegexReplaceProcessorFactory”配置添加到“ updateRequestProcessorChain”,以过滤出/ n,/ t或所需的任何替换项。

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="update.chain">extract</str>
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
      <str name="capture">h1</str>
      <str name="fmap.h1">h1_content</str>
    </lst>
</requestHandler>

<updateRequestProcessorChain name="extract">
    <processor class="solr.RegexReplaceProcessorFactory"> 
        <str name="fieldName">h1_content</str> 
        <str name="pattern">\n</str> 
        <str name="replacement"></str> 
    </processor> 
    <processor class="solr.RegexReplaceProcessorFactory"> 
        <str name="fieldName">h1_content</str> 
        <str name="pattern">\t</str> 
        <str name="replacement"></str> 
    </processor> 
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>