我需要完全针对一组“短文档”进行查询。例如:
文件:
预期结果:
ES可以吗?我怎样才能实现这一目标?我尝试提升“名称”,但我找不到如何与文档字段完全匹配,而不是在其中搜索。
答案 0 :(得分:5)
您所描述的是搜索引擎默认情况下的工作方式。搜索"John Doe"
会搜索"john"
和"doe"
这两个词。对于每个术语,它会查找包含该术语的文档,然后根据以下内容为每个文档指定_score
:
您没有看到明确结果的原因是Elasticsearch已分发,您正在使用少量数据进行测试。默认情况下,索引具有5个主分片,并且您的文档在不同分片上编制索引。每个分片都有自己的doc频率计数,因此分数会被扭曲。
当您添加实际数据量时,频率甚至会超过分片,但是为了测试少量数据,您需要执行以下两项操作之一:
search_type=dfs_query_then_fetch
要演示,首先索引您的数据:
curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1' -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'
现在,搜索"john doe"
,记住指定dfs_query_then_fetch
。
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john doe"
}
}
}
'
Doc 1是结果中的第一个:
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 8
# }
当您只搜索"john"
时:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john"
}
}
}
'
Doc 3首先出现:
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 1,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 0.625,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.5,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# }
# ],
# "max_score" : 1,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 5
# }
第二个问题是匹配"John Doé
“。这是分析的问题。为了使全文更易于搜索,我们分析它单独的术语或标记,它们是存储在索引中的内容。为了在用户搜索john
时匹配例如John
,JOHN
和john
,每个术语/ token通过许多令牌过滤器传递,以将它们放入标准格式。
当我们进行全文搜索时,搜索字词会经历完全相同的过程。因此,如果我们有一个包含John
的文档,则会将其编入索引john
,如果用户搜索JOHN
,我们实际上会搜索john
。
为了使Doé
匹配doe
,我们需要一个删除重音的令牌过滤器,我们需要将它应用于被索引的文本和搜索词。最简单的方法是使用ASCII folding token filter。
我们可以在创建索引时定义自定义分析器,并且我们可以在映射中指定特定字段应该在索引时和搜索时使用该分析器。
首先,删除旧索引:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
然后创建索引,指定自定义分析器和映射:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"no_accents" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "no_accents"
}
}
}
}
}
'
重新索引数据:
curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1' -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'
现在,测试搜索:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john doé"
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 6
# }
答案 1 :(得分:2)
我认为如果你映射为多个字段,你将获得所需的东西,并提升未分析的字段:
"name": {
"type": "multi_field",
"fields": {
"untouched": {
"type": "string",
"index": "not_analyzed",
"boost": "1.1"
},
"name": {
"include_in_all": true,
"type": "string",
"index": "analyzed",
"search_analyzer": "someanalyzer",
"index_analyzer": "someanalyzer"
}
}
}
如果你需要灵活性,你可以通过在query_string中使用'^' - 表示法来提高查询时间而不是索引时间
{
"query_string" : {
"fields" : ["name, name.untouched^5"],
"query" : "this AND that OR thus",
}
}