apache solr搜索的结果太多了

时间:2016-07-11 07:52:13

标签: solr lucene

我正在使用apache solr 5.1。

solr索引中有超过13000个文档,我正在使用apache tikka索引pdf文档。

用于提高搜索相关性我正在使用edimax解析器,它运行良好,我得到了预期的结果。

但是,单个字查询只有3个结果,而是返回400多个结果,其中3个预期结果位于顶部,其他结果无关紧要。

这是我用于schema.xml

中几乎所有字段的字段类型
 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true" omitNorms="true">
  <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.KeywordRepeatFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"     generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"     splitOnCaseChange="1" />

  </analyzer>
  <analyzer type="query">      
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.KeywordRepeatFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"     generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"     splitOnCaseChange="1" />
  </analyzer>
</fieldType>

示例查询参数。

 {
  "responseHeader": {
    "status": 0,
    "QTime": 149,
    "params": {
      "mm": "100%",
      "qs": "10",
      "ps": "10",
      "indent": "true",
      "q.op": "AND",
      "lowercaseOperators": "true",
      "q": "b4u",
      "defType": "edismax",
      "qf": "story_title^5.0  tax_payer_name^3.0  judgement_text^1.0  story_description^1.0  nature_of_the_issues  decision_summary  additional_comments  facts_of_the_case  section_number  case_law_citation",
      "pf": "story_title^5.0  tax_payer_name^3.0  judgement_text^1.0  story_description^1.0  nature_of_the_issues  decision_summary  additional_comments  facts_of_the_case  section_number  case_law_citation",
      "wt": "json",
      "stopwords": "true",
      "_": "1468224236421"
    }
  },

提前致谢。

1 个答案:

答案 0 :(得分:0)

我通过删除HTMLStripCharFilterFactory解决了这个问题,它应该在索引时删除html字符,但它将“b4u”索引为“b”,“4”和“U”,这会导致太多结果。

我现在正在索引时通过php strip_tags函数删除html标签。