Solr:在计算小平面数字时忽略字符串的套管

时间:2017-08-27 17:09:59

标签: string solr case-insensitive faceted-search

我的数据库中title字段有以下值:

"I Am A String"
"I am A string"

我想在搜索结果中将标题字段作为构面提供。

目前的结果:

<lst name="title">
    <int name="I Am A String">4</int>
    <int name="I am A string">3</int>
</lst>

期望的结果:

<lst name="title">
    <int name="I Am A String">7</int>
</lst>

我实际上并不关心为最终结果选择哪两个可用的字符串选项,只要对同一个facet计算相同的字符串(不区分大小写)。

我尝试了title字段的以下字段定义。我还添加了生成的构面逻辑。

string =将套管视为不同的字符串
string_exact =将套管视为不同的字符串
text_ws =分解成套管完整的单词
text =分成单独的词
textTight =分成单独的单词
textTrue =用套管完整的单词分解 string_exacttest =用套管完整的单词分解

这是我的schema.xml

<field name="title" type="string" indexed="true" stored="true"/>


<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />

<fieldType name="string_exact" class="solr.TextField"
    sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>           
    </analyzer>
</fieldType>    

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
    Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.-->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>


<!-- Less flexible matching, but less false matches. Probably not ideal for product names,but may be good for SKUs.  Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
    <!--
      this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with
      stemming.
    -->
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>


<fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
  </analyzer>
</fieldType>    

在计算构面时,如何确保将相同的字符串(忽略大小写)组合在一起?

1 个答案:

答案 0 :(得分:1)

string_exact定义几乎就是您所需要的,但您也需要应用LowercaseFilter,以便每个句子都是小写的。 KeywordTokenizer将整个值保存为单个标记(因此您不会将其视为基于空格的单独术语),并且虽然字符串字段不允许任何其他处理,但具有KeywordTokenizer的TextField的行为方式相同 - 但是你可以添加过滤器来处理令牌之后的处理方式。

<fieldType name="string_facet" class="solr.TextField" sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>     
        <filter class="solr.LowerCaseFilterFactory"/>      
    </analyzer>
</fieldType>