使用Solr

时间:2017-03-08 11:23:33

标签: django solr autocomplete search-suggestion

我使用Solr 6.4和Haystack 2.6.1,pySolr 3.6:

我正在寻找谷歌般的建议自动完成功能。实际上使用EdgeNGram工作得很好,但它只返回我想要的文档标题:

示例:

typing: 'new y'
return:

New york, fabulous city that never sleep
A trip to new york by night
...

这使用户只能选择特定于建议列表中的文档,搜索将仅返回基于建议标题的搜索文档。

我想要的是一个关于如下物品的建议:

typing: 'new y'
return:

new york
new york by night
new york city
trip to new york

有一篇文章建议用户使用索引查询返回结果,然后将这些查询用作建议: https://lucidworks.com/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

这意味着解析solr日志或使用数据库中一堆已保存用户查询的数据导入(DIH)。

实际上这篇文章已经过时了(2009年),从那以后,Solr给我们带来了建议者(https://cwiki.apache.org/confluence/display/solr/Suggester

无论如何,我想知道是否有一个很好的教程,如何使用建议器与重复查询,而不是返回我的文档标题,而无需在数据库中保存用户的查询,通过预定的过程导入它们,重新索引等。

我的search_indexes.py

class ArticleIndex(indexes.SearchIndex, indexes.Indexable): 

    text = indexes.CharField(document=True, use_template=True)
    created = indexes.DateTimeField(model_attr='created')
    rating = indexes.IntegerField(model_attr='rating')
    title = indexes.CharField(model_attr='title', boost=1.125)
    term = indexes.EdgeNgramField(model_attr='title')


    def get_model(self):
            return Article

我的article_text.txt

{{ object.title }}
{{ object.created }}
{{ object.rating }}

我的schema.xml

<field name="term" type="text_general" indexed="true" stored="true" />
<field name="weight" type="float" indexed="true" stored="true" />

<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front" />
  </analyzer>
</fieldType>

<fieldType name="suggestType" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9]" replacement=" " />
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

我的solrconfig.xml

<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
    <lst name="defaults">
        <str name="suggest">true</str>
        <str name="suggest.dictionary">infixSuggester</str>
        <str name="suggest.onlyMorePopular">true</str>
        <str name="suggest.count">10</str>
        <str name="suggest.collate">true</str>
    </lst>
    <arr name="components">
        <str>suggest</str>
    </arr>
</requestHandler>
<searchComponent name="suggest" class="solr.SuggestComponent">
    <lst name="suggester">
        <str name="name">infixSuggester</str>
        <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
        <str name="indexPath">infix_suggestions</str>
        <str name="highlight">false</str>
        <str name="dictionaryImpl">DocumentDictionaryFactory</str>
        <str name="field">term</str>
        <str name="weightField">weight</str>
        <str name="suggestAnalyzerFieldType">suggestType</str>
        <str name="buildOnStartup">false</str>
        <str name="buildOnCommit">false</str>
    </lst>
</searchComponent> 

我使用pysolr来查询Solr,因为Haystack还没有实现建议方法:

from pysolr import Solr

solr = Solr(settings.HAYSTACK_CONNECTIONS['default']['URL'], search_handler='/suggest', use_qt_param=False)
raw_results = solr.search('', **{'suggest.q': query_string})

2 个答案:

答案 0 :(得分:1)

经过艰难的挣扎,我终于得到了一些东西。不完美但足够好。

根据这篇文章: "May std::vector make use of small buffer optimization?"

我使用了FreeTextLookupFactory

我的search_indexes.py

class ArticleIndex(indexes.SearchIndex, indexes.Indexable): 

    text = indexes.CharField(document=True, use_template=True)
    created = indexes.DateTimeField(model_attr='created')
    rating = indexes.IntegerField(model_attr='rating')
    title = indexes.CharField(model_attr='title', boost=1.125)

    def get_model(self):
            return Article

我的schema.xml

<field name="django_ct" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="django_id" type="string" indexed="true" stored="true" multiValued="false"/>


<field name="text" type="text_en" indexed="true" stored="true" multiValued="false"  termVectors="true" />
<field name="rating" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="title" type="text_en" indexed="true" stored="true" multiValued="false"/>
<field name="created" type="date" indexed="true" stored="true" multiValued="false"/>

我的Solrconfig.xml

<searchComponent name="suggest" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">suggest</str>
    <str name="lookupImpl">FreeTextLookupFactory</str> 
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">title</str>
    <str name="ngrams">3</str>
    <float name="threshold">0.004</float>
    <str name="highlight">false</str>
    <str name="buildOnCommit">false</str>
    <str name="separator"> </str>
    <str name="suggestFreeTextAnalyzerFieldType">text_general</str>
  </lst>
</searchComponent>

<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
  <lst name="defaults">
    <str name="suggest.dictionary">suggest</str>
    <str name="suggest">true</str>
    <str name="suggest.count">10</str>
  </lst>
  <arr name="components">
    <str>suggest</str>
  </arr>
</requestHandler>

当我使用Solr 6.4时,默认情况下它是托管模式模式(没有考虑我在schema.xml中的更改),我不得不通过添加solrconfig.xml切换到手动编辑模式:

<schemaFactory class="ClassicIndexSchemaFactory"/>

见这里:http://alexbenedetti.blogspot.fr/2015/07/solr-you-complete-me.html

然后重新启动Solr,使用haystack和rebuild_index重建索引

当然用curl构建建议者:     卷曲https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig#SchemaFactoryDefinitioninSolrConfig-Switchingfromschema.xmltoManagedSchema

最后结果:

curl http://127.0.0.1:8983/solr/collection1/suggest?suggest.q=new%20y

我会尝试更多地了解FreeTextLookupFactory,看看我是否可以让它更准确但它已经令人满意了。 希望这有帮助。

PS:始终关注日志:     http://127.0.0.1:8983/solr/collection1/suggest?suggest.build=true 我强烈建议让它始终在标签上打开。它节省了我的痛苦时间......

答案 1 :(得分:0)

根据您的需要,我建议使用BlendedInfixLookupFactory设置如下:

在schema.xml中,创建一个将用于建议者的字段,然后复制到该字段中:

<field name="title" type="text_general" indexed="true" stored="true" /> 
<field name="term_suggest" type="phrase_suggest" indexed="true" stored="true" multiValued="true"/>

<copyField source="title" dest="term_suggest"/>

<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>

</fieldType>
  <fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

然后在solrconfig.xml文件中:

<searchComponent name="suggest" class="solr.SuggestComponent">
   <lst name="suggester">
      <str name="name">suggest</str>
      <str name="lookupImpl">BlendedInfixLookupFactory</str>
      <str name="blenderType">linear</str>
      <str name="dictionaryimpl">DocumentDictionaryFactory</str>
      <str name="field">term_suggest</str>
      <str name="weightField">weight</str>
      <str name="suggestAnalyzerFieldType">text_suggest</str>
      <str name="queryAnalyzerFieldType">phrase_suggest</str>
      <str name="indexPath">suggest</str>
      <str name="buildOnStartup">false</str>
      <str name="buildOnCommit">false</str>
      <bool name="exactMatchFirst">true</bool>
   </lst> 
</searchComponent>

<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="wt">json</str>
      <str name="indent">false</str>
    <str name="suggest">true</str>
    <str name="suggest.count">10</str>
  </lst>
  <arr name="components">
    <str>suggest</str>
  </arr>
</requestHandler>

使用BlendedInfixLookupFactory,您可以在字段中的任何位置找到“new y”,从而为开始时发生的事件赋予更大的权重。使用suggestAnalyzerFieldType的标准标记生成器和queryAnalyzerFieldType的关键字标记生成器的组合将使您可以使用空格进行搜索(查询“new y”将被读取为字符串或关键字)。

您发布的汇总维基链接很好,最后一次修改于2016年9月。

编辑: 我没有意识到你不想要整个游戏。您可以尝试使用带状符,通过将上述模式中的phrase_suggest fieldType更改为:

<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" 
            minShingleSize="2"
            maxShingleSize="4"
            outputUnigrams="true"
            outputUnigramsIfNoShingles="true"/>
    </analyzer>
</fieldType>

EDIT2: 或者,您可以将phrase_suggest与标准标记生成器一起使用,并使用用于索引分析器的shingle过滤器和用于查询分析器的关键字标记生成器:

<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" 
            minShingleSize="2"
            maxShingleSize="4"
            outputUnigrams="true"
            outputUnigramsIfNoShingles="true"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
</fieldType>

然后对于建议的searchComponent,你只需要:

<str name="suggestAnalyzerFieldType">phrase_suggest</str>

(并且没有queryAnalyzerFieldType)。当然,您需要更改ShingleFilterFactory设置以满足您的需求。