我正在研究一个建议者,我得到的结果包含标点符号。例如,当我输入“Volcan”时,我得到:
“火山”, “火山”, “火山”, “火山”,< - 逗号 “火山”。 < - period / full stop
以下是solrconfig.xml文件中的代码:
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name="field">text</str>
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
</lst>
<lst name="invariants">
<!-- always run the Suggester for queries to this handler -->
<str name="spellcheck">true</str>
<!-- collate not needed, query if tokenized as keyword, we need only suggestions for that term -->
<str name="spellcheck.collate">false</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
在schema.xml文件中,我有:
<fieldType name="spell" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false" multiValued="true" termVectors="true" termPositions="true" termOffsets="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
结果是:
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"wt": "json",
"q": "volcan"
}
},
"spellcheck": {
"suggestions": [
"volcan",
{
"numFound": 5,
"startOffset": 0,
"endOffset": 6,
"suggestion": [
"volcanoes",
"volcanic",
"volcano",
"volcano,",
"volcanoes."
]
}
]
}
}
答案 0 :(得分:0)
问题并不在你的requestHandler上......而是,它似乎存在于你为进入法术区域的文件编制索引的方式,也可能是它自己的拼写字段。 我认为你应该启用一个标记器来去掉那些字段中的标点符号。
这里是schema.xml中适合我的拼写字段定义
<fieldType name="spell" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>