谷歌风格自动完成&自动修正与弹性搜索

时间:2016-06-06 04:16:58

标签: elasticsearch

我试图实现谷歌风格自动完成&用弹性搜索自动修正。

映射:

POST music
{
  "settings": {
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      },
      "analyzer": {
        "nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        },
        "whitespace_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "song": {
      "properties": {
        "song_field": {
          "type": "string",
          "analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "suggest": {
          "type": "completion",
          "analyzer": "simple",
          "search_analyzer": "simple",
          "payloads": true
        }
      }
    }
  }
}

文档:

POST music/song
{
  "song_field" : "beautiful queen",
  "suggest" : "beautiful queen"
}

POST music/song
{
  "song_field" : "beautiful",
  "suggest" : "beautiful"
}

我希望当用户输入时:" beaatiful q"他会得到类似beautiful queen的东西(beaatiful被纠正为美丽而q被完成为女王)。

我尝试过以下查询:

POST music/song/_search?search_type=dfs_query_then_fetch
{
  "size": 10,
  "suggest": {
    "didYouMean": {
      "text": "beaatiful q",
      "completion": {
        "field": "suggest"
      }
    }
  },
  "query": {
    "match": {
      "song_field": {
        "query": "beaatiful q",
         "fuzziness": 2
      }
    }
  }
}

不幸的是,Completion suggester不允许任何拼写错误,所以我得到了回复:

"suggest": {
    "didYouMean": [
      {
        "text": "beaatiful q",
        "offset": 0,
        "length": 11,
        "options": []
      }
    ]
  }

此外,搜索给了我这些结果(虽然用户开始编写"女王"但美丽排名更高):

"hits": [
      {
        "_index": "music",
        "_type": "song",
        "_id": "AVUj4Y5NancUpEdFLeLo",
        "_score": 0.51315063,
        "_source": {
          "song_field": "beautiful"
          "suggest": "beautiful"
        }
      },
      {
        "_index": "music",
        "_type": "song",
        "_id": "AVUj4XFAancUpEdFLeLn",
        "_score": 0.32071912,
        "_source": {
          "song_field": "beautiful queen"
          "suggest": "beautiful queen"
        }
      }
    ]

更新!!!

我发现我可以使用模糊查询和完成建议器,但现在查询时我没有得到任何建议(模糊只支持2个编辑距离):

POST music/song/_search
{
  "size": 10,
  "suggest": {
    "didYouMean": {
      "text": "beaatefal q",
      "completion": {
        "field": "suggest",
        "fuzzy" : {
                "fuzziness" : 2
            }
      }
    }
  }
}

我仍然期待" beautiful queen"作为建议回应。

1 个答案:

答案 0 :(得分:1)

当你想提供2个或更多单词作为搜索建议时,我发现了(困难的方法),在Elasticsearch中使用ngrams或edgengrams是不值得的。

使用Shingles token filtershingles analyzer将为您提供多字短语,如果您将其与match_phrase_prefix结合使用,它应该为您提供所需的功能。

基本上是这样的:

this

不要忘记进行映射:

    PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" 
                    ]
                }
            }
        }
    }
}

}

Ngrams和edgengrams将标记单个字符,而Shingles分析器和过滤器,分组字母(制作单词)并提供更有效的方式来生成和搜索短语。我花了很多时间搞乱上面的2,直到我看到Shingles提到并阅读它。好多了。