Question

我想接受像“jan do”这样的查询，并将其与“jane doe”，“don janek”等值相匹配 - 当然还有：“jan do”，“do jan”。

所以我现在可以想到的规则是：

根据非字母数字值（例如空格，符号，标点符号）对查询进行标记化
每个查询标记都充当匹配数据存储中的标记的前缀
令牌出现的顺序无关紧要。喜欢“jan do”到“do jan”

到目前为止，我有这个映射

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "question": {
      "properties": {
        "title": {
          "type": "string"
        },
        "answer": {
          "type": "object",
          "properties": {
            "text": {
              "type": "string",
              "analyzer": "my_keyword",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzer": "standard"
                }
              }
            }
          }
        }
      }
    }
  }
}

我一直在搜索短语：

POST /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "answer.text": {
              "query": "jan do",
              "type": "phrase_prefix"
            }
          }
        },
        {
          "match": {
            "answer.text.stemmed": {
              "query": "jan do",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

当事情真正启动该短语时，这种方法可行，但现在我想将查询标记化并将每个标记视为前缀。

我有办法（可能在查询时）吗？

我的另一个选择就是构建一个这样的查询：

POST test/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "prefix": {
            "answer.text.stemmed": "jan"
          }
        },
        {
          "prefix": {
            "answer.text.stemmed": "do"
          }
        }
      ]
    }
  }
}

这似乎有效，但它不保留单词的顺序。此外，我觉得这是作弊，可能不是最高性能的选择。如果有10个前缀怎么办？ 100？我想知道是否有人不这样做。

Answer 1

如上面的评论所示，您应该查看Elasticsearch中的ngrams，特别是edge ngrams。

我在this blog post中为Qbox写了一篇关于使用ngrams的介绍，但这里有一个你可以使用的简单例子。

这是一个索引定义，它将edge ngram token filter以及其他几个过滤器应用于自定义分析器（使用standard tokenizer）。

ES 2.0中analyzers are applied的方式发生了一些变化。但请注意，我"search_analyzer"使用standard analyzer。这是因为我不希望将搜索文本标记为ngrams，我希望它直接与索引标记匹配。我将引用您的博客文章来了解详细信息。

无论如何，这是映射：

PUT /test_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "autocomplete": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "standard",
                  "stop",
                  "kstem",
                  "edgengram_filter"
               ]
            }
         },
         "filter": {
            "edgengram_filter": {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 15
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "name": {
               "type": "string",
               "analyzer": "autocomplete",
               "search_analyzer": "standard"
            },
            "price":{
                "type": "integer"
            }
         }
      }
   }
}

然后我索引一些简单的文档：

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"name": "very cool shoes","price": 26}
{"index":{"_id":2}}
{"name": "great shampoo","price": 15}
{"index":{"_id":3}}
{"name": "shirt","price": 25}

现在，以下查询将为我提供预期的自动完成结果：

POST /test_index/_search
{
   "query": {
      "match": {
         "name": {
            "query": "ver sh",
            "operator": "and"
         }
      }
   }
}
...
{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.2169777,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.2169777,
            "_source": {
               "name": "very cool shoes",
               "price": 26
            }
         }
      ]
   }
}

以下是我在示例中使用的所有代码：

http://sense.qbox.io/gist/c2ba05900d0749fa3b1ba516c66431ae1a9d5e61

如何将多个单词匹配为标记前缀

1 个答案: