我正在使用apache solr 5.1。
solr索引中有超过13000个文档,我正在使用apache tikka索引pdf文档。
用于提高搜索相关性我正在使用edimax解析器,它运行良好,我得到了预期的结果。
但是,单个字查询只有3个结果,而是返回400多个结果,其中3个预期结果位于顶部,其他结果无关紧要。
这是我用于schema.xml
中几乎所有字段的字段类型 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true" omitNorms="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
</analyzer>
</fieldType>
示例查询参数。
{
"responseHeader": {
"status": 0,
"QTime": 149,
"params": {
"mm": "100%",
"qs": "10",
"ps": "10",
"indent": "true",
"q.op": "AND",
"lowercaseOperators": "true",
"q": "b4u",
"defType": "edismax",
"qf": "story_title^5.0 tax_payer_name^3.0 judgement_text^1.0 story_description^1.0 nature_of_the_issues decision_summary additional_comments facts_of_the_case section_number case_law_citation",
"pf": "story_title^5.0 tax_payer_name^3.0 judgement_text^1.0 story_description^1.0 nature_of_the_issues decision_summary additional_comments facts_of_the_case section_number case_law_citation",
"wt": "json",
"stopwords": "true",
"_": "1468224236421"
}
},
提前致谢。
答案 0 :(得分:0)
我通过删除HTMLStripCharFilterFactory解决了这个问题,它应该在索引时删除html字符,但它将“b4u”索引为“b”,“4”和“U”,这会导致太多结果。
我现在正在索引时通过php strip_tags函数删除html标签。