无法匹配干草堆弹性搜索中的数字

时间:2014-12-04 05:28:03

标签: elasticsearch django-haystack

我有一些我索引的产品类似于" 99%巧克力"。如果我搜索巧克力,它会匹配这个特定的项目,但如果我搜索" 99",它就不匹配。我遇到了同样问题的Using django haystack autocomplete with elasticsearch to search for digits/numbers?,但没有人回答他的问题。有人可以帮忙吗?

编辑2:对不起,我忽略了一个重要的细节。数字搜索本身有效,但自动完成功能不起作用。我包括相关部分:

#the relevant line in my index
    name_auto = indexes.EdgeNgramField(model_attr='name')

#the relevant line in my view
prodSqs = SearchQuerySet().models(Product).autocomplete(name_auto=request.GET.get('q', ''))

编辑:以下是运行分析器的结果:

curl -XGET 'localhost:9200/haystack/_analyze?analyzer=standard&pretty' -d '99% chocolate'
{
  "tokens" : [ {
    "token" : "99",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<NUM>",
    "position" : 1
  }, {
    "token" : "chocolate",
    "start_offset" : 4,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

2 个答案:

答案 0 :(得分:3)

终于找到了答案:ElasticSearch: EdgeNgrams and Numbers

添加以下类并更改设置文件中Haystack_connections下的Engine,以使用下面的CustomElasticsearchSearchEngine而不是默认的haystack one:

class CustomElasticsearchBackend(ElasticsearchSearchBackend):
    """
    The default ElasticsearchSearchBackend settings don't tokenize strings of digits the same way as words, so they
    get lost: the lowercase tokenizer is the culprit. Switching to the standard tokenizer and doing the case-
    insensitivity in the filter seems to do the job.
    """
    def __init__(self, connection_alias, **connection_options):
        # see https://stackoverflow.com/questions/13636419/elasticsearch-edgengrams-and-numbers
        self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['tokenizer'] = 'standard'
        self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['filter'].append('lowercase')
        super(CustomElasticsearchBackend, self).__init__(connection_alias, **connection_options)

class CustomElasticsearchSearchEngine(ElasticsearchSearchEngine):
    backend = CustomElasticsearchBackend

答案 1 :(得分:0)

通过标准分析器运行字符串99% chocolate会得到正确的结果(99就是一个单独的术语),所以如果你当前没有使用它,你应该切换到它。

curl -XGET 'localhost:9200/myindex/_analyze?analyzer=standard&pretty' -d '99% chocolate'
{
  "tokens" : [ {
    "token" : "99",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<NUM>",
    "position" : 1
  }, {
    "token" : "chocolate",
    "start_offset" : 4,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}