Solr在短语查询中忽略第三个位置的标记

时间:2015-01-22 15:10:40

标签: solr

在Solr(4.10.3)中我有一个查询(不使用dismax或edismax)

t:"past surgical cardiovascular system"

查询调试输出

"rawquerystring": "t:\"past surgical cardiovascular system\"",
"querystring": "t:\"past surgical cardiovascular system\"",
"parsedquery": "MultiPhraseQuery(t:\"(ex former formerly previous prior past) (surgery surg surgical operative)\")",
"parsedquery_toString": "t:\"(ex former formerly previous prior past) (surgery surg surgical operative)\"",

似乎solr完全忽略了从第三个位置开始的令牌。 我调了一下,因为这是我8小时调查后第一次注意到这一点。 我错过了什么? 如何强制solr考虑第三和第四个令牌?

如果有帮助,t字段的类型为:

    <fieldType name="text_en_splitting" class="solr.TextField"
        positionIncrementGap="100" autoGeneratePhraseQueries="false">
        <analyzer type="index">
            <!-- <tokenizer class="solr.WhitespaceTokenizerFactory" /> -->
            <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*[\{\}\[\]\|\(\):;,]\s*|\b[-/+]\b|\s+[&amp;+-]\s+|(?:\b')?\s+|\.(?=\z|\s)" />
            <!-- in this example, we will only use synonyms at query time <filter
                class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
                expand="false"/> -->
            <!-- Case insensitive stop word removal. add enablePositionIncrements=true
                in both the index and query analyzers to leave a 'gap' for more accurate
                phrase queries. -->
            <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10"/>
            <filter class="solr.ClassicFilterFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1" generateNumberParts="1" catenateWords="1"
                catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> -->
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.EnglishPossessiveFilterFactory" />
            <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            <filter class="solr.EnglishMinimalStemFilterFactory" />
        </analyzer>
        <analyzer type="query">
            <!-- <tokenizer class="solr.WhitespaceTokenizerFactory" /> -->
            <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*[\{\}\[\]\|\(\):;,]\s*|\b[-/+]\b|\s+[&amp;+-]\s+|(?:\b')?\s+|\.(?=\z|\s)" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
            <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10"/>
            <filter class="solr.ClassicFilterFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1" generateNumberParts="1" catenateWords="0"
                catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" /> -->
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.EnglishPossessiveFilterFactory" />
            <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            <filter class="solr.EnglishMinimalStemFilterFactory" />
            <!-- <filter class="solr.PorterStemFilterFactory" /> -->
        </analyzer>
    </fieldType>

我认为solr中有一个错误。

我运行了不同的查询,我在解析的查询中获得了所有令牌:

"rawquerystring": "t:\"acute myocardial infarction surgical\"",
"querystring": "t:\"acute myocardial infarction surgical\"",
"parsedquery": "MultiPhraseQuery(t:\"(acute aqt) (myocardial myocrd) (infarct infarction nfrct) (surgery surg surgical)\")",
"parsedquery_toString": "t:\"(acute aqt) (myocardial myocrd) (infarct infarction nfrct) (surgery surg surgical)\"",

如果我前置过去&#39;查询然后删除了tokes

"rawquerystring": "t:\"past acute myocardial infarction surgical\"",
"querystring": "t:\"past acute myocardial infarction surgical\"",
"parsedquery": "MultiPhraseQuery(t:\"(ex former formerly previous prior past) (acute aqt) (myocardial myocrd)\")",
"parsedquery_toString": "t:\"(ex former formerly previous prior past) (acute aqt) (myocardial myocrd)\"",

分析页面没有给我太多细节,因为它独立分析了令牌

2 个答案:

答案 0 :(得分:0)

您有一个极其复杂的查询分析器链。幸运的是,您可以使用Web Admin UI中的 Analyze 屏幕确切了解其中发生的情况。

因此,您可以将您的短语放在那里(在右侧进行查询处理),并逐步查看单词会发生什么。

这应该告诉您,例如,某些术语是否在其中一个层中被意外吞下。

答案 1 :(得分:0)

我终于发现了这个问题:我使用solr.LimitTokenCountFilterFactory在使用同义词扩展后将查询限制为10个令牌。 解决方案是删除此过滤器