Question

我试图实现谷歌风格自动完成＆amp;用弹性搜索自动修正。

映射：

POST music
{
  "settings": {
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      },
      "analyzer": {
        "nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        },
        "whitespace_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "song": {
      "properties": {
        "song_field": {
          "type": "string",
          "analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "suggest": {
          "type": "completion",
          "analyzer": "simple",
          "search_analyzer": "simple",
          "payloads": true
        }
      }
    }
  }
}

文档：

POST music/song
{
  "song_field" : "beautiful queen",
  "suggest" : "beautiful queen"
}

POST music/song
{
  "song_field" : "beautiful",
  "suggest" : "beautiful"
}

我希望当用户输入时：＆＃34; beaatiful q＆＃34;他会得到类似beautiful queen的东西（beaatiful被纠正为美丽而q被完成为女王）。

我尝试过以下查询：

POST music/song/_search?search_type=dfs_query_then_fetch
{
  "size": 10,
  "suggest": {
    "didYouMean": {
      "text": "beaatiful q",
      "completion": {
        "field": "suggest"
      }
    }
  },
  "query": {
    "match": {
      "song_field": {
        "query": "beaatiful q",
         "fuzziness": 2
      }
    }
  }
}

不幸的是，Completion suggester不允许任何拼写错误，所以我得到了回复：

"suggest": {
    "didYouMean": [
      {
        "text": "beaatiful q",
        "offset": 0,
        "length": 11,
        "options": []
      }
    ]
  }

此外，搜索给了我这些结果（虽然用户开始编写＆＃34;女王＆＃34;但美丽排名更高）：

"hits": [
      {
        "_index": "music",
        "_type": "song",
        "_id": "AVUj4Y5NancUpEdFLeLo",
        "_score": 0.51315063,
        "_source": {
          "song_field": "beautiful"
          "suggest": "beautiful"
        }
      },
      {
        "_index": "music",
        "_type": "song",
        "_id": "AVUj4XFAancUpEdFLeLn",
        "_score": 0.32071912,
        "_source": {
          "song_field": "beautiful queen"
          "suggest": "beautiful queen"
        }
      }
    ]

更新!!!

我发现我可以使用模糊查询和完成建议器，但现在查询时我没有得到任何建议（模糊只支持2个编辑距离）：

POST music/song/_search
{
  "size": 10,
  "suggest": {
    "didYouMean": {
      "text": "beaatefal q",
      "completion": {
        "field": "suggest",
        "fuzzy" : {
                "fuzziness" : 2
            }
      }
    }
  }
}

我仍然期待＆＃34; beautiful queen＆＃34;作为建议回应。

Answer 1

当你想提供2个或更多单词作为搜索建议时，我发现了（困难的方法），在Elasticsearch中使用ngrams或edgengrams是不值得的。

使用Shingles token filter和shingles analyzer将为您提供多字短语，如果您将其与match_phrase_prefix结合使用，它应该为您提供所需的功能。

基本上是这样的：

this

不要忘记进行映射：

    PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" 
                    ]
                }
            }
        }
    }
}

}

Ngrams和edgengrams将标记单个字符，而Shingles分析器和过滤器，分组字母（制作单词）并提供更有效的方式来生成和搜索短语。我花了很多时间搞乱上面的2，直到我看到Shingles提到并阅读它。好多了。

谷歌风格自动完成＆amp;自动修正与弹性搜索

1 个答案: