在Solr,为什么“建造”不是为了“建造”,而是“建筑”?

时间:2011-08-18 01:10:48

标签: lucene solr stemming porter-stemmer

我试图弄清楚这篇文章中的两件事:

  1. 为什么“建立”被阻止为“构建”,即使是 字段类型定义定义了一个词干分析器。然而,'建筑'是 被限制为'建立'

  2. 如何使用Luke检查索引以查看哪些词被阻止 什么我无法看到'建筑'被阻止'建造' 在卢克。我知道Lucene正在阻止它,因为我能够 通过搜索成功检索带有“building”的行 '构建'。

  3. link非常有用,但没有回答我的问题。

    供参考,这里是schema.xml部分。

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>
    

    ,字段定义是

    <field name="features" type="text_en" indexed="true" stored="true" multiValued="true"/>
    

    数据集由多个文档组成,1个文档在features字段中有“building”,1个文档在同一个字段中“构建”,1个文档在features字段中有“Built-in”:

    file:hd.xml:

    <field name="features">building NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor</field>
    

    file ipod_video.xml:

    <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field>
    

    文件sd500.xml:

     <field name="features">built in flash, red-eye reduction</field>
    

    使用Lukeall-3.3.0,这是我搜索'features:build'时得到的结果。请注意,我回来1(而不是预期的3个文件) enter image description here 即使在那一个文档中,我也看不到词干,即我只看到原始单词“building”,如图所示: enter image description here

    再次在Luke中搜索'features:built',返回两个文档: enter image description here

    选择其中一个,显示原始的“内置”但不显示“构建”。 enter image description here

1 个答案:

答案 0 :(得分:2)

对于像这样的特殊情况,您可以使用StemmerOverrideFilter

调整词干分析算法