Solr(Open Solr)建议结果包含标点符号

时间:2014-11-27 11:39:38

标签: solr search-suggestion

我正在研究一个建议者,我得到的结果包含标点符号。例如,当我输入“Volcan”时,我得到:

“火山”, “火山”, “火山”, “火山”,< - 逗号 “火山”。 < - period / full stop

以下是solrconfig.xml文件中的代码:

<searchComponent class="solr.SpellCheckComponent" name="suggest">
  <lst name="spellchecker">
    <str name="name">suggest</str>
    <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
    <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
    <str name="field">text</str>
    <float name="threshold">0.005</float>
    <str name="buildOnCommit">true</str>
  </lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="spellcheck">true</str>
    <str name="spellcheck.dictionary">suggest</str>
    <str name="spellcheck.onlyMorePopular">true</str>
    <str name="spellcheck.count">5</str>
    <str name="spellcheck.collate">true</str>
  </lst>
  <lst name="invariants">
      <!-- always run the Suggester for queries to this handler -->
      <str name="spellcheck">true</str>
      <!-- collate not needed, query if tokenized as keyword, we need only suggestions for that term -->
      <str name="spellcheck.collate">false</str>
  </lst>
  <arr name="components">
    <str>suggest</str>
  </arr>
</requestHandler>

在schema.xml文件中,我有:

<fieldType name="spell" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false" multiValued="true" termVectors="true" termPositions="true" termOffsets="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.ShingleFilterFactory"
                    minShingleSize="2"
                    maxShingleSize="4"
                    outputUnigrams="true"
                    outputUnigramsIfNoShingles="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

结果是:

{
    "responseHeader": {
        "status": 0,
        "QTime": 0,
        "params": {
            "wt": "json",
            "q": "volcan"
        }
    },
    "spellcheck": {
        "suggestions": [
            "volcan",
            {
                "numFound": 5,
                "startOffset": 0,
                "endOffset": 6,
                "suggestion": [
                    "volcanoes",
                    "volcanic",
                    "volcano",
                    "volcano,",
                    "volcanoes."
                ]
            }
        ]
    }
}

1 个答案:

答案 0 :(得分:0)

问题并不在你的requestHandler上......而是,它似乎存在于你为进入法术区域的文件编制索引的方式,也可能是它自己的拼写字段。 我认为你应该启用一个标记器来去掉那些字段中的标点符号。

这里是schema.xml中适合我的拼写字段定义

<fieldType name="spell" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>