Question

我的Solr模式中有一个字段title：

<field name="title" type="text_en" termVectors="true" indexed="true" required="true" stored="true" />

text_en的定义如下：

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100" docValues="false" multiValued="false">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true" />
        <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_en.txt" ignoreCase="true" expand="true" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.PorterStemFilterFactory" />
    </analyzer>
</fieldType>

使用包含停用词的多词同义词时，我遇到奇怪的行为。

如果停用词出现在中间，则可以正常工作。例如，如果我的同义词文件中包含以下内容（其中i是停用词）：

iphone, apple i phone

如果我查询：/select?q=iphone&qf=title&defType=edismax

已解析的查询为：+DisjunctionMaxQuery(((((+title:appl +title:phone) title:iphon))))

与查询相同：/select?q=apple i phone&qf=title&defType=edismax

但是，如果停用词出现在开头或结尾，则行为是不可预测的。

在大多数情况下，整个同义词都被删除。例如，如果我将同义词文件更改为：

iphone, i phone

然后再次进行相同的查询（使用iphone），我得到：

+DisjunctionMaxQuery(((title:iphon)))

我期望在dismax查询中使用iphon和phone（因为i将被删除）。

在某些情况下，行为甚至更怪异。

例如，如果我的同义词文件是：

between two ferns,netflix comedy,zach galifianakis show,netflix 2019 best

，我以ferns和best作为停用词。如果我执行以下查询：

/select?q=netflix comedy&qf=title&defType=edismax

我明白了：

+DisjunctionMaxQuery((((+title:between +title:two +title:galifianaki +title:show) (+title:netflix +title:2019 +title:comedi))))

这是非常奇怪的组合。

我无法理解此行为，并且在文档或Internet中未找到与此相关的任何内容。也许我想念一些东西。任何帮助/指针都受到高度赞赏。

Solr版本：8.4.1

Solr-一起使用同义词和停用词时出现奇怪的问题

0 个答案: