Django haystack EdgeNgramField给出了与弹性搜索不同的结果

时间:2013-12-06 17:55:37

标签: python django elasticsearch django-haystack

我目前正在使用弹性搜索后端运行haystack,现在我正在为城市名称构建自动完成功能。问题是SearchQuerySet给了我不同的结果,从我的角度来看是错误的,而不是直接在elasticsearch中执行的相同查询,这对我来说是预期的结果。

我正在使用:Django 1.5.4, django-haystack 2.1.0, pyelasticsearch 0.6.1, elasticsearch 0.90.3

使用以下示例数据:

  • 中场
  • 米德兰市
  • Midway
  • 次要
  • Minturn
  • 迈阿密海滩

使用

SearchQuerySet().models(Geoname).filter(name_auto='mid')
or
SearchQuerySet().models(Geoname).autocomplete(name_auto='mid')

结果总是返回所有6个名字,包括Min *和Mia * ......但是,查询elasticsearch会直接返回正确的数据:

"query": {
    "filtered" : {
        "query" : {
            "match_all": {}
        },
        "filter" : {
             "term": {"name_auto": "mid"}
        }
    }
}

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": [
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075977",
            "_score": 1,
            "_source": {
               "name_auto": "Midfield",
            }
         },
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075984",
            "_score": 1,
            "_source": {
               "name_auto": "Midland City",
            }
         },
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075989",
            "_score": 1,
            "_source": {
               "name_auto": "Midway",
            }
         }
      ]
   }
}

不同的例子的行为是一样的。我的猜测是,通过所有可能的“min_gram”字符组分割和分析字符串干草堆,这就是它返回错误结果的原因。

我不确定我是在做什么还是在理解错误的东西,如果这是干草堆应该如何工作,但我需要干草堆结果与弹性搜索结果相匹配。

那么,我该如何解决问题或使其有效?

我的总结对象如下:

型号:

class Geoname(models.Model):
    id = models.IntegerField(primary_key=True)
    name = models.CharField(max_length=255)

指数:

class GeonameIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    name_auto = indexes.EdgeNgramField(model_attr='name')

    def get_model(self):
        return Geoname

映射:

modelresult: {
    _boost: {
        name: "boost",
        null_value: 1
    },
    properties: {
        django_ct: {
            type: "string"
        },
        django_id: {
            type: "string"
        },
        name_auto: {
            type: "string",
            store: true,
            term_vector: "with_positions_offsets",
            analyzer: "edgengram_analyzer"
        }
    }
}

谢谢。

2 个答案:

答案 0 :(得分:11)

深入研究代码后,我发现haystack生成的搜索是:

{
  "query":{
     "filtered":{
        "filter":{
           "fquery":{
              "query":{
                 "query_string":{
                    "query": "django_ct:(csi.geoname)"
                 }
              },
              "_cache":false
           }
        },
        "query":{
           "query_string":{
              "query": "name_auto:(mid)",
              "default_operator":"or",
              "default_field":"text",
              "auto_generate_phrase_queries":true,
              "analyze_wildcard":true
           }
        }
     }
  },
  "from":0,
  "size":6
}

在elasticsearch中运行此查询,结果显示了haystack显示的相同6个对象...但是如果我添加到“query_string”

"analyzer": "standard"

它按预期工作。因此,我们的想法是能够为该领域设置不同的搜索分析器。

根据@ user954994答案的链接以及this post的说明,我最终做的工作是:

  1. 我创建了自定义elasticsearch后端,添加了一个基于标准分析器的新自定义分析器。
  2. 我添加了一个自定义EdgeNgramField,启用了为索引设置特定分析器(index_analyzer)和另一个搜索分析器(search_analyzer)的方法。
  3. 所以,我的新设置是:

    ELASTICSEARCH_INDEX_SETTINGS = {
        'settings': {
            "analysis": {
                "analyzer": {
                    "ngram_analyzer": {
                        "type": "custom",
                        "tokenizer": "lowercase",
                        "filter": ["haystack_ngram"]
                    },
                    "edgengram_analyzer": {
                        "type": "custom",
                        "tokenizer": "lowercase",
                        "filter": ["haystack_edgengram"]
                    },
                    "suggest_analyzer": {
                        "type":"custom",
                        "tokenizer":"standard",
                        "filter":[
                            "standard",
                            "lowercase",
                            "asciifolding"
                        ]
                    },
                },
                "tokenizer": {
                    "haystack_ngram_tokenizer": {
                        "type": "nGram",
                        "min_gram": 3,
                        "max_gram": 15,
                    },
                    "haystack_edgengram_tokenizer": {
                        "type": "edgeNGram",
                        "min_gram": 2,
                        "max_gram": 15,
                        "side": "front"
                    }
                },
                "filter": {
                    "haystack_ngram": {
                        "type": "nGram",
                        "min_gram": 3,
                        "max_gram": 15
                    },
                    "haystack_edgengram": {
                        "type": "edgeNGram",
                        "min_gram": 2,
                        "max_gram": 15
                    }
                }
            }
        }
    }
    

    我的新自定义build_schema方法如下所示:

    def build_schema(self, fields):
        content_field_name, mapping = super(ConfigurableElasticBackend,
                                              self).build_schema(fields)
    
        for field_name, field_class in fields.items():
            field_mapping = mapping[field_class.index_fieldname]
    
            index_analyzer = getattr(field_class, 'index_analyzer', None)
            search_analyzer = getattr(field_class, 'search_analyzer', None)
            field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)
    
            if field_mapping['type'] == 'string' and field_class.indexed:
                if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
                    field_mapping['analyzer'] = field_analyzer
    
            if index_analyzer and search_analyzer:
                field_mapping['index_analyzer'] = index_analyzer
                field_mapping['search_analyzer'] = search_analyzer
                del(field_mapping['analyzer'])
    
            mapping.update({field_class.index_fieldname: field_mapping})
        return (content_field_name, mapping)
    

    在重建索引之后,我的映射如下所示:

    modelresult: {
       _boost: {
           name: "boost",
           null_value: 1
       },
       properties: {
           django_ct: {
               type: "string"
           },
           django_id: {
               type: "string"
           },
           name_auto: {
               type: "string",
               store: true,
               term_vector: "with_positions_offsets",
               index_analyzer: "edgengram_analyzer",
               search_analyzer: "suggest_analyzer"
           }
       }
    }
    

    现在一切都按预期工作了!

    <强>更新

    Bellow你会找到澄清这一部分的代码:

      
        
    1. 我创建了自定义elasticsearch后端,添加了一个基于标准分析器的新自定义分析器。
    2.   
    3. 我添加了一个自定义EdgeNgramField,启用了为索引(index_analyzer)设置特定分析器的方法和另一个分析器   搜索(search_analyzer)。
    4.   

    进入我的app search_backends.py:

    from django.conf import settings
    from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend
    from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine
    from haystack.fields import EdgeNgramField as BaseEdgeNgramField
    
    
    # Custom Backend 
    class CustomElasticBackend(ElasticsearchSearchBackend):
    
        DEFAULT_ANALYZER = None
    
        def __init__(self, connection_alias, **connection_options):
            super(CustomElasticBackend, self).__init__(
                                    connection_alias, **connection_options)
            user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS', None)
            self.DEFAULT_ANALYZER = getattr(settings, 'ELASTICSEARCH_DEFAULT_ANALYZER', "snowball")
            if user_settings:
                setattr(self, 'DEFAULT_SETTINGS', user_settings)
    
        def build_schema(self, fields):
            content_field_name, mapping = super(CustomElasticBackend,
                                                  self).build_schema(fields)
    
            for field_name, field_class in fields.items():
                field_mapping = mapping[field_class.index_fieldname]
    
                index_analyzer = getattr(field_class, 'index_analyzer', None)
                search_analyzer = getattr(field_class, 'search_analyzer', None)
                field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)
    
                if field_mapping['type'] == 'string' and field_class.indexed:
                    if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
                        field_mapping['analyzer'] = field_analyzer
    
                if index_analyzer and search_analyzer:
                    field_mapping['index_analyzer'] = index_analyzer
                    field_mapping['search_analyzer'] = search_analyzer
                    del(field_mapping['analyzer'])
    
                mapping.update({field_class.index_fieldname: field_mapping})
            return (content_field_name, mapping)
    
    
    class CustomElasticSearchEngine(ElasticsearchSearchEngine):
        backend = CustomElasticBackend
    
    
    # Custom field
    class CustomFieldMixin(object):
    
        def __init__(self, **kwargs):
            self.analyzer = kwargs.pop('analyzer', None)
            self.index_analyzer = kwargs.pop('index_analyzer', None)
            self.search_analyzer = kwargs.pop('search_analyzer', None)
            super(CustomFieldMixin, self).__init__(**kwargs)
    
    
    class CustomEdgeNgramField(CustomFieldMixin, BaseEdgeNgramField):
        pass
    

    我的索引定义如下:

    class MyIndex(indexes.SearchIndex, indexes.Indexable):
        text = indexes.CharField(document=True, use_template=True)
        name_auto = CustomEdgeNgramField(model_attr='name', index_analyzer="edgengram_analyzer", search_analyzer="suggest_analyzer")
    

    最后,设置当然使用了haystack连接定义的自定义后端:

    HAYSTACK_CONNECTIONS = {
        'default': {
            'ENGINE': 'my_app.search_backends.CustomElasticSearchEngine',
            'URL': 'http://localhost:9200',
            'INDEX_NAME': 'index'
        },
    }
    

答案 1 :(得分:1)

好吧,我遇到了类似的问题,我的策略是定制后端。

完整说明可在以下网址找到:

http://www.wellfireinteractive.com/blog/custom-haystack-elasticsearch-backend/

对我有用!

希望这会有所帮助。