SOLR Parser用于带句点的缩写词

时间:2012-06-15 16:01:10

标签: solr

在SOLR Apache 3.6中搜索USC并选择了高亮显示时,为什么它不会选择USC.以及突出显示的结果?

字段类型如下:

 <fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
            enablePositionIncrements="true"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
   <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
            enablePositionIncrements="true"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

我希望SOLR返回U.S.C.以及突出显示的搜索结果中的USC。

然而它只是返回USC:

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">7</int><lst name="params"><str name="explainOther"/><str name="fl">*,score</str><str name="indent">on</str><str name="start">0</str><str name="q">USC</str><str name="hl.fl">*</str><str name="wt"/><str name="fq"/><str name="hl">on</str><str name="version">2.2</str><str name="rows">10</str></lst></lst><result name="response" numFound="1" start="0" maxScore="0.047945753"><doc><float name="score">0.047945753</float><str name="id">978-064172344522</str><arr name="title"><str>my <a href="www.foo.bar">link</a>  power-shot PowerShot USC Utility <br>hello</br> Rejections Under 35 U.S.C. 101 and 35 U.S.C. 112, First Paragraph Petitions to correct inventorship of an issued patent are decided by the <Underline>Supervisory Patent Examiner</Underline>, as set forth</str></arr></doc></result><lst name="highlighting"><lst name="978-064172344522"><arr name="title"><str>my <a href="www.foo.bar">link</a>  power-shot PowerShot <em>USC</em> Utility <br>hello</br> Rejections Under</str></arr></lst></lst></response>

1 个答案:

答案 0 :(得分:0)

如果你去Solr的分析页面,并运行字符串“U.S.C.”在text_en_splitting字段类型上,您会看到它被编入索引为三个单独的标记:usc。使用WordDelimiterFilterFactory(也许是catenateAll属性)的属性,并查看是否可以将其作为usc(一个标记)而不是三个拆分标记进行索引。如果这不起作用,也许您必须扩展标记器以适应您的情况。