搜索字符串时solr索引中出现意外结果

时间:2014-05-27 11:46:53

标签: solr lucene

我已经设置了一个SOLR环境并使用了text_nl字段类型,我填写了其他几个字段。

我正在经历一些奇怪的行为。每当我搜索" new"时,查询返回索引中带有new的结果,但也会返回一些没有" new"他们中的字符串。我已经禁用了过滤器工厂,但无济于事。我一直在查询中得到结果,但不包含这个词。

下面你会找到我的solrconfig.xml和schema.xml。

Fieldtype text_nl:

<fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
        <filter class="solr.ReversedWildcardFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" />
        <!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

字段名称:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Merk" type="text_nl" indexed="false" stored="true"/>
<field name="Model" type="text_nl" indexed="false" stored="true" multiValued="true" />
<field name="Kleur" type="text_nl" indexed="false" stored="true"/>
<field name="Collectie" type="text_nl" indexed="false" stored="true"/>
<field name="Categorie" type="text_nl" indexed="true" stored="true"/>
<field name="MateriaalSoort" type="text_nl" indexed="false" stored="true"/>
<field name="Zool" type="text_nl" indexed="false" stored="true"/>
<field name="Omschrijving" type="text_nl" indexed="false" stored="true"/>
    <field name="text" type="text_nl" indexed="true" stored="true" multiValued="true"/>

solrconfig.xml中

<requestHandler name="/query" class="solr.SearchHandler">
 <lst name="defaults">
   <str name="echoParams">explicit</str>
   <int name="rows">50000</int>
   <str name="wt">json</str>
   <str name="indent">true</str>
   <str name="df">text</str>
   <str name="fl">id,Merk,Model,Kleur,Collectie,Categorie,Zool,Omschrijving</str>
   <str name="qf">Merk^100 Model^0.8 Omschrijving^0.3 id^1.0</str>
   <str name="pf">Merk^100 Model^0.8 Omschrijving^0.3 id^1.0</str>
 </lst>

数据如下: /查询Q =新

收率:

{
        "id":"3215.70.101204",
        "Merk":"New balance",
        "Model":["M576"],
        "Kleur":"Groen",
        "Collectie":"Herenschoenen",
        "Categorie":"Sneakers",
        "Zool":"Rubber",
        "Omschrijving":"Groene nubuck special runner van het merk New Balance. Het logo is van groen nubuck."},
      {
        "id":"3215.26.104592",
        "Merk":"Greve",
        "Model":["6260"],
        "Kleur":"Jeans",
        "Collectie":"Herenschoenen",
        "Categorie":"Sneakers",
        "Zool":"Rubber",
        "Omschrijving":"Deze jeans blauwe su&egrave;de/lederen runner is van het merk Greve. De runner heeft een merklabel van Greve aan de achterzijde. De runner heeft een witte met houten middenzool en een rubberen zool, verder heeft de runner zilveren studs details."},

正如你所看到的,没有&#34;新的&#34;在第二个id的结果中。

这是调试查询的结果:

debug":{
    "rawquerystring":"new",
    "querystring":"new",
    "parsedquery":"text:new",
    "parsedquery_toString":"text:new",
    "explain":{
      "3215.13.101204":"\n1.4514455 = (MATCH) weight(text:new in 2047) [DefaultSimilarity], result of:\n  1.4514455 = fieldWeight in 2047, product of:\n    1.7320508 = tf(freq=3.0), with freq of:\n      3.0 = termFreq=3.0\n    4.469293 = idf(docFreq=113, maxDocs=3661)\n    0.1875 = fieldNorm(doc=2047)\n",
      "3215.30.101204":"\n1.4514455 = (MATCH) weight(text:new in 2142) [DefaultSimilarity], result of:\n  1.4514455 = fieldWeight in 2142, product of:\n    1.7320508 = tf(freq=3.0), with freq of:\n      3.0 = termFreq=3.0\n    4.469293 = idf(docFreq=113, maxDocs=3661)\n    0.1875 = fieldNorm(doc=2142)\n",
      "3215.70.101204":"\n1.4514455 = (MATCH) weight(text:new in 2217) [DefaultSimilarity], result of:\n  1.4514455 = fieldWeight in 2217, product of:\n    1.7320508 = tf(freq=3.0), with freq of:\n      3.0 = termFreq=3.0\n    4.469293 = idf(docFreq=113, maxDocs=3661)\n    0.1875 = fieldNorm(doc=2217)\n",
      "3215.26.104592":"\n1.3966541 = (MATCH) weight(text:new in 2137) [DefaultSimilarity], result of:\n  1.3966541 = fieldWeight in 2137, product of:\n    2.0 = tf(freq=4.0), with freq of:\n      4.0 = termFreq=4.0\n    4.469293 = idf(docFreq=113, maxDocs=3661)\n    0.15625 = fieldNorm(doc=2137)\n",
      "3215.34.104592":"\n1.3966541 = (MATCH) weight(text:new in 2185) [DefaultSimilarity], result of:\n  1.3966541 = fieldWeight in 2185, product of:\n    2.0 = tf(freq=4.0), with freq of:\n      4.0 = termFreq=4.0\n    4.469293 = idf(docFreq=113, maxDocs=3661)\n    0.15625 = fieldNorm(doc=2185)\n",
      "3215.70.104592":"\n1.3966541 = (MATCH) weight(text:new in 2232) [DefaultSimilarity], result of:\n  1.3966541 = fieldWeight in 2232, product of:\n    2.0 = tf(freq=4.0), with freq of:\n      4.0 = termFreq=4.0\n    4.469293 = idf(docFreq=113, maxDocs=3661)\n    0.15625 = fieldNorm(doc=2232)\n",

1 个答案:

答案 0 :(得分:0)

由于EdgeNGramFilterReversedWildcardFilter的组合,可能会发生这种情况。 EdgeNGramFilter首先将术语拆分为三个或更多的ngrams。然后将这些中的每一个都以正向和反向形式编入索引,因此,如果您将单词&#34; go&#34;编入索引,则最终得到:

  • ngrams:&#34; wen&#34;,&#34; ent&#34;,&#34;去了&#34;
  • reversewildcard:&#34; wen&#34;,&#34; new&#34;,&#34; ent&#34;,&#34; tne&#34;,&#34; go&#34;, &#34; TNEW&#34;

所以你得到一个匹配的术语&#34;去了#34;查询&#34; new&#34;。任何包含&#34; new&#34;或&#34; wen&#34;可以预期匹配。

真的,我认为使用这两者都是矫枉过正的。扭转ngrams对我来说并没有多大意义。它们都是解决类似问题的方法,在我看来它们并没有合理使用。

此外,您可能在&#34; synonyms.txt&#34;中定义了同义词。对于单词&#34; new&#34;。