与ElasticSearch匹配的精确文档

时间:2013-03-21 12:19:52

标签: lucene elasticsearch

我需要完全针对一组“短文档”进行查询。例如:

文件:

  1. {“name”:“John Doe”,“alt”:“John W Doe”}
  2. {“name”:“我的朋友John Doe”,“alt”:“John A Doe”}
  3. {“name”:“John”,“alt”:“Susy”}
  4. {“name”:“Jack”,“alt”:“John Doe”}
  5. 预期结果:

    1. 如果我搜索“John Doe”,我希望得分1比得分2和4大得多
    2. 如果我搜索“JohnDoé”,与上面相同
    3. 如果我搜索“John”,我想得到3(完全匹配比名称和替换中的重复更好)
    4. ES可以吗?我怎样才能实现这一目标?我尝试提升“名称”,但我找不到如何与文档字段完全匹配,而不是在其中搜索。

2 个答案:

答案 0 :(得分:5)

您所描述的是搜索引擎默认情况下的工作方式。搜索"John Doe"会搜索"john""doe"这两个词。对于每个术语,它会查找包含该术语的文档,然后根据以下内容为每个文档指定_score

  • 该术语在所有文件中的常见程度(更常见==相关性较低)
  • 该文件字段中的术语有多常见(更常见==更相关)
  • 文件的字段有多长(更长= =不太相关)

您没有看到明确结果的原因是Elasticsearch已分发,您正在使用少量数据进行测试。默认情况下,索引具有5个主分片,并且您的文档在不同分片上编制索引。每个分片都有自己的doc频率计数,因此分数会被扭曲。

当您添加实际数据量时,频率甚至会超过分片,但是为了测试少量数据,您需要执行以下两项操作之一:

  1. 创建仅包含一个主分片的索引,或
  2. 指定在使用全局频率运行查询之前首先从每个分片中获取频率的search_type=dfs_query_then_fetch
  3. 要演示,首先索引您的数据:

    curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
    {
       "alt" : "John W Doe",
       "name" : "John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
    {
       "alt" : "John A Doe",
       "name" : "My friend John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
    {
       "alt" : "Susy",
       "name" : "John"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
    {
       "alt" : "John Doe",
       "name" : "Jack"
    }
    '
    

    现在,搜索"john doe",记住指定dfs_query_then_fetch

    curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
    {
       "query" : {
          "match" : {
             "name" : "john doe"
          }
       }
    }
    '
    

    Doc 1是结果中的第一个:

    # {
    #    "hits" : {
    #       "hits" : [
    #          {
    #             "_source" : {
    #                "alt" : "John W Doe",
    #                "name" : "John Doe"
    #             },
    #             "_score" : 1.0189849,
    #             "_index" : "test",
    #             "_id" : "1",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John A Doe",
    #                "name" : "My friend John Doe"
    #             },
    #             "_score" : 0.81518793,
    #             "_index" : "test",
    #             "_id" : "2",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "Susy",
    #                "name" : "John"
    #             },
    #             "_score" : 0.3066778,
    #             "_index" : "test",
    #             "_id" : "3",
    #             "_type" : "test"
    #          }
    #       ],
    #       "max_score" : 1.0189849,
    #       "total" : 3
    #    },
    #    "timed_out" : false,
    #    "_shards" : {
    #       "failed" : 0,
    #       "successful" : 5,
    #       "total" : 5
    #    },
    #    "took" : 8
    # }
    

    当您只搜索"john"时:

    curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
    {
       "query" : {
          "match" : {
             "name" : "john"
          }
       }
    }
    '
    

    Doc 3首先出现:

    # {
    #    "hits" : {
    #       "hits" : [
    #          {
    #             "_source" : {
    #                "alt" : "Susy",
    #                "name" : "John"
    #             },
    #             "_score" : 1,
    #             "_index" : "test",
    #             "_id" : "3",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John W Doe",
    #                "name" : "John Doe"
    #             },
    #             "_score" : 0.625,
    #             "_index" : "test",
    #             "_id" : "1",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John A Doe",
    #                "name" : "My friend John Doe"
    #             },
    #             "_score" : 0.5,
    #             "_index" : "test",
    #             "_id" : "2",
    #             "_type" : "test"
    #          }
    #       ],
    #       "max_score" : 1,
    #       "total" : 3
    #    },
    #    "timed_out" : false,
    #    "_shards" : {
    #       "failed" : 0,
    #       "successful" : 5,
    #       "total" : 5
    #    },
    #    "took" : 5
    # }
    

    忽略重音

    第二个问题是匹配"John Doé“。这是分析的问题。为了使全文更易于搜索,我们分析它单独的术语或标记,它们是存储在索引中的内容。为了在用户搜索john时匹配例如JohnJOHNjohn,每个术语/ token通过许多令牌过滤器传递,以将它们放入标准格式。

    当我们进行全文搜索时,搜索字词会经历完全相同的过程。因此,如果我们有一个包含John的文档,则会将其编入索引john,如果用户搜索JOHN,我们实际上会搜索john

    为了使Doé匹配doe,我们需要一个删除重音的令牌过滤器,我们需要将它应用于被索引的文本和搜索词。最简单的方法是使用ASCII folding token filter

    我们可以在创建索引时定义自定义分析器,并且我们可以在映射中指定特定字段应该在索引时和搜索时使用该分析器。

    首先,删除旧索引:

    curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1' 
    

    然后创建索引,指定自定义分析器和映射:

    curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
    {
       "settings" : {
          "analysis" : {
             "analyzer" : {
                "no_accents" : {
                   "filter" : [
                      "standard",
                      "lowercase",
                      "asciifolding"
                   ],
                   "type" : "custom",
                   "tokenizer" : "standard"
                }
             }
          }
       },
       "mappings" : {
          "test" : {
             "properties" : {
                "name" : {
                   "type" : "string",
                   "analyzer" : "no_accents"
                }
             }
          }
       }
    }
    '
    

    重新索引数据:

    curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
    {
       "alt" : "John W Doe",
       "name" : "John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
    {
       "alt" : "John A Doe",
       "name" : "My friend John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
    {
       "alt" : "Susy",
       "name" : "John"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
    {
       "alt" : "John Doe",
       "name" : "Jack"
    }
    '
    

    现在,测试搜索:

    curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
    {
       "query" : {
          "match" : {
             "name" : "john doé"
          }
       }
    }
    '
    
    # {
    #    "hits" : {
    #       "hits" : [
    #          {
    #             "_source" : {
    #                "alt" : "John W Doe",
    #                "name" : "John Doe"
    #             },
    #             "_score" : 1.0189849,
    #             "_index" : "test",
    #             "_id" : "1",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John A Doe",
    #                "name" : "My friend John Doe"
    #             },
    #             "_score" : 0.81518793,
    #             "_index" : "test",
    #             "_id" : "2",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "Susy",
    #                "name" : "John"
    #             },
    #             "_score" : 0.3066778,
    #             "_index" : "test",
    #             "_id" : "3",
    #             "_type" : "test"
    #          }
    #       ],
    #       "max_score" : 1.0189849,
    #       "total" : 3
    #    },
    #    "timed_out" : false,
    #    "_shards" : {
    #       "failed" : 0,
    #       "successful" : 5,
    #       "total" : 5
    #    },
    #    "took" : 6
    # }
    

答案 1 :(得分:2)

我认为如果你映射为多个字段,你将获得所需的东西,并提升未分析的字段:

 "name": {
            "type": "multi_field",
            "fields": {
                "untouched": {
                    "type": "string",
                    "index": "not_analyzed",
                    "boost": "1.1"
                },
                "name": {
                    "include_in_all": true,
                    "type": "string",
                    "index": "analyzed",
                    "search_analyzer": "someanalyzer",
                    "index_analyzer": "someanalyzer"
                }
            }
        }

如果你需要灵活性,你可以通过在query_string中使用'^' - 表示法来提高查询时间而不是索引时间

{
    "query_string" : {
        "fields" : ["name, name.untouched^5"],
        "query" : "this AND that OR thus",
    }
}