在Elasticsearch中,拼写建议正在回归

时间:2014-04-25 17:59:01

标签: elasticsearch django-haystack spelling

我很确定这与词干有关,而且我不确定我需要改变什么才能获得拼写建议来回复整个单词。

设置为:

ELASTICSEARCH_INDEX_SETTINGS = {
  'settings': {
    "analysis": {
        "analyzer": {
            "default": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard", "lowercase", "stop_words", "cm_snow"]
            },
            "ngram_analyzer": {
                "type": "custom",
                "tokenizer": "lowercase",
                "filter": ["haystack_ngram"]
            },
            "edgengram_analyzer": {
                "type": "custom",
                "tokenizer": "lowercase",
                "filter": ["haystack_edgengram"]
            }
        },
        "tokenizer": {
            "haystack_ngram_tokenizer": {
                "type": "nGram",
                "min_gram": 3,
                "max_gram": 15,
            },
            "haystack_edgengram_tokenizer": {
                "type": "edgeNGram",
                "min_gram": 2,
                "max_gram": 15,
                "side": "front"
            }
        },
        "filter": {
            "haystack_ngram": {
                "type": "nGram",
                "min_gram": 3,
                "max_gram": 15
            },
            "haystack_edgengram": {
                "type": "edgeNGram",
                "min_gram": 2,
                "max_gram": 15
            },
            "cm_snow": {
                "type": "snowball",
                "language": "English"
            },
            "stop_words": {
                "type": "stop",
                "ignore_case": True,
                "stopwords": STOP_WORDS
            }
        }
    }
  }
}

如果我对Elasticsearch执行以下查询:

curl -XPOST 'localhost:9200/listing/_suggest' -d '{
  "my-suggestion" : {
    "text" : "table",
    "term" : {
      "field" : "text"
    }
  }
}'

我回来了:

{"text":"tabl","offset":0,"length":5,"options":[]}

为什么结果" tabl",即使对于拼写正确的单词?

1 个答案:

答案 0 :(得分:3)

问题是我使用的是默认分析器,而默认的分析器使用的是雪球,它使用的是雪球index_analyzer,所以这些单词的索引就是它们的主干。

因为我们仍然想搜索词干,我在文档调用中添加了一个额外的字段,建议使用标准分析器。在那里,我放了一个文本blob中的一堆文字(标题,描述,标签),标记为include_in_all=false这里是它的映射:

"suggest": {
    "type": "string",
    "analyzer": "standard"
},

然后在我的查询中,我针对_all查询实际搜索结果,但使用建议进行建议。

{
  "query": {
     "match": {
         "_all": "tabel"
     }
  },
  "suggest": {
    "suggest-0": {
      "term": {
        "field": "suggest",
        "size": 5
      },
      "text": "tabls"
    }
  }
}

给出了:

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
    },
    "suggest": {
        "suggest-0": [
            {
                "text": "tabls",
                "offset": 0,
                "length": 5,
                "options": [
                    {
                        "text": "table",
                        "score": 0.8,
                        "freq": 858
                    },
                    {
                        "text": "tables",
                        "score": 0.8,
                        "freq": 682
                    },
                    {
                        "text": "tails",
                        "score": 0.8,
                        "freq": 4
                    },
                    {
                        "text": "tabs",
                        "score": 0.75,
                        "freq": 4
                    },
                    {
                        "text": "tools",
                        "score": 0.6,
                        "freq": 176
                    }
                ]
            }
        ]
    }
}

然后我的UI代码知道向用户提出建议,以便他们可以进行更好的搜索。