在SOLR Apache 3.6中搜索USC并选择了高亮显示时,为什么它不会选择USC.以及突出显示的结果?
字段类型如下:
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
我希望SOLR返回U.S.C.以及突出显示的搜索结果中的USC。
然而它只是返回USC:
<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">7</int><lst name="params"><str name="explainOther"/><str name="fl">*,score</str><str name="indent">on</str><str name="start">0</str><str name="q">USC</str><str name="hl.fl">*</str><str name="wt"/><str name="fq"/><str name="hl">on</str><str name="version">2.2</str><str name="rows">10</str></lst></lst><result name="response" numFound="1" start="0" maxScore="0.047945753"><doc><float name="score">0.047945753</float><str name="id">978-064172344522</str><arr name="title"><str>my <a href="www.foo.bar">link</a> power-shot PowerShot USC Utility <br>hello</br> Rejections Under 35 U.S.C. 101 and 35 U.S.C. 112, First Paragraph Petitions to correct inventorship of an issued patent are decided by the <Underline>Supervisory Patent Examiner</Underline>, as set forth</str></arr></doc></result><lst name="highlighting"><lst name="978-064172344522"><arr name="title"><str>my <a href="www.foo.bar">link</a> power-shot PowerShot <em>USC</em> Utility <br>hello</br> Rejections Under</str></arr></lst></lst></response>
答案 0 :(得分:0)
如果你去Solr的分析页面,并运行字符串“U.S.C.”在text_en_splitting
字段类型上,您会看到它被编入索引为三个单独的标记:u
,s
和c
。使用WordDelimiterFilterFactory(也许是catenateAll属性)的属性,并查看是否可以将其作为usc
(一个标记)而不是三个拆分标记进行索引。如果这不起作用,也许您必须扩展标记器以适应您的情况。