我需要在文档中找到短语,我需要查看标题和内容。标题比内容更重要,所以我希望得到以下结果:
这似乎是一个非常基本的东西。
所以我已经创建了这样的索引和数据:
PUT /test_index
PUT /test_index/article/3263
{
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "abc"
}
PUT /test_index/article/1005
{
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
}
PUT /test_index/article/677
{
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
}
PUT /test_index/article/666
{
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
}
我运行这样的查询:
GET /test_index/_search
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "Lösungen",
"fields": ["pagetitle^2", "searchable_content"]
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
但结果并不如我所料。我得到的文档只有标题匹配才能在标题和内容中匹配,如下所示:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.5753642,
"hits": [
{
"_index": "test_index",
"_type": "article",
"_id": "3263",
"_score": 0.5753642,
"_source": {
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "abc"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "1005",
"_score": 0.36464313,
"_source": {
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
},
"highlight": {
"searchable_content": [
"test! <em>Lösungen</em> test?"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "677",
"_score": 0.36464313,
"_source": {
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test!"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "666",
"_score": 0.2876821,
"_source": {
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc"
]
}
}
]
}
}
我试图做的是通过增强字段来操纵更多。似乎在上面的案例中,两个字段都设置了工作设置,并使用most_fields
作为这样的类型:
GET /test_index/_search
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "Lösungen",
"fields": ["pagetitle^3", "searchable_content^2"],
"type": "most_fields"
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
这给出了这组数据的预期结果。
但是,如果我添加2条额外记录:
PUT /test_index/article/999
{
"id": 999,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc double match Lösungen"
}
PUT /test_index/article/1006
{
"id": 1006,
"pagetitle": "Lösungen and Lösungen",
"searchable_content": "test sample"
}
它不再工作了,因为现在的结果是这样的:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 2.2315955,
"hits": [
{
"_index": "test_index",
"_type": "article",
"_id": "1006",
"_score": 2.2315955,
"_source": {
"id": 1006,
"pagetitle": "Lösungen and Lösungen",
"searchable_content": "test sample"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em> and <em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "666",
"_score": 1.219939,
"_source": {
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "1005",
"_score": 0.86785066,
"_source": {
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
},
"highlight": {
"searchable_content": [
"test! <em>Lösungen</em> test?"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "677",
"_score": 0.86785066,
"_source": {
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test!"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "3263",
"_score": 0.8630463,
"_source": {
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "abc"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "999",
"_score": 0.7876096,
"_source": {
"id": 999,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc double match Lösungen"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc double match <em>Lösungen</em>"
]
}
}
]
}
}
所以当你看到只有内容匹配的文字高于标题和内容匹配的文字时。
请您解释一下我在这里做错了什么以及如何解决?
答案 0 :(得分:1)
尝试这样的常数得分:
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"constant_score": {
"query": {
"match": {
"pagetitle": {
"query": "Lösungen"
}
}
},
"boost": 2
}
},
{
"constant_score": {
"query": {
"match": {
"searchable_content": "Lösungen"
}
}
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
根据文档的常量得分:“...包装另一个查询,只返回一个等于过滤器中每个文档的查询提升的常量分数。” ref
@ davide的链接可以帮助您理解为什么即使是searchable_content上的匹配也可以为文档带来更高的分数。由于您要忽略跨字段的术语频率和IDF,您可以在每个字段的匹配上使用常量分数。
根据原始问题中列出的规则,上述查询完美无缺。但是根据OP的评论,我们还需要根据搜索词的出现频率对结果进行排名。显然,术语频率和反向文档频率很重要,但也许我们并不关心这里的字段长度(如果我们只想对结果数量进行排序)。在这种情况下,我建议你像这样设置索引:
POST test_index_v1
{
"mappings": {
"article": {
"properties": {
"id": {
"type": "long"
},
"pagetitle": {
"type": "string",
"norms": {
"enabled": false
}
},
"searchable_content": {
"type": "string",
"norms": {
"enabled": false
}
}
}
}
}
}
注意:type: string
在版本5及更高版本中被type: text
替换。
@davide提到的link描述了禁用规范的功能。
其次,当您在少量文档上运行查询时,并假设您为索引分配了多个分片,最好使用search_type=dfs_query_then_fetch
运行查询,因为每个分片的本地IDF会有所不同很多。 (阅读this)
第三,添加到最后一个查询,我们想要的只是考虑TF-IDF的一些权重。最后一个查询对文档进行了完全相同的排序,无论是在同一字段中搜索词的2或3次出现。 我们可以添加一个bool-should块来添加常量分数块的分数,如下所示:
GET test_index_v1/_search?search_type=dfs_query_then_fetch
{
"query": {
"bool": {
"should": [
{
"constant_score": {
"query": {
"match": {
"pagetitle": {
"query": "Lösungen"
}
}
},
"boost": 2
}
},
{
"constant_score": {
"query": {
"match": {
"searchable_content": "Lösungen"
}
}
}
},
{
"bool": {
"should": [
{
"match": {
"pagetitle": {
"query": "Lösungen",
"boost": 2
}
}
},
{
"match": {
"searchable_content": "Lösungen"
}
}
]
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
答案 1 :(得分:1)
感谢@ArchitSaxana,现在似乎工作正常。如果有人需要类似的东西,我会展示完整的例子(连同fuzinness):
PUT test_index_v1
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "ngram_filter"]
}
}
}
},
"mappings": {
"doc": {
"_all": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard"
},
"properties": {
"pagetitle": {
"type": "text",
"include_in_all": true,
"term_vector": "yes",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard",
"norms": false
},
"searchable_content": {
"type": "text",
"include_in_all": true,
"term_vector": "yes",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard",
"norms": false
}
}
}
}
}
PUT /test_index_v1/article/1006
{
"id": 1006,
"pagetitle": "Lösungen Lösungen",
"searchable_content": "test"
}
PUT /test_index_v1/article/3263
{
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "test"
}
PUT /test_index_v1/article/1005
{
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
}
PUT /test_index_v1/article/677
{
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
}
PUT /test_index_v1/article/666
{
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
}
PUT /test_index_v1/article/999
{
"id": 999,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc double match Lösungen"
}
PUT /test_index_v1/article/18000
{
"id": 18000,
"pagetitle": "abc Lösungen and Lösungen",
"searchable_content": "test Lösungen test abc double match Lösungen"
}
PUT /test_index_v1/article/18001
{
"id": 18001,
"pagetitle": "abc Lösungen ",
"searchable_content": "test Lösungen test abc double match Lösungen"
}
PUT /test_index_v1/article/18001
{
"id": 18001,
"pagetitle": "abc Lupungen ",
"searchable_content": "test Lupungen test abc double match Lupungen"
}
GET test_index_v1/_search?search_type=dfs_query_then_fetch
{
"query": {
"bool": {
"should": [
{
"constant_score": {
"query": {
"match": {
"pagetitle": {
"query": "Lupungen",
"fuzziness": "AUTO"
}
}
},
"boost": 2
}
},
{
"constant_score": {
"query": {
"match": {
"searchable_content": {
"query": "Lupungen",
"fuzziness": "AUTO"
}
}
}
}
},
{
"bool": {
"should": [
{
"match": {
"pagetitle": {
"query": "Lupungen",
"fuzziness": "AUTO" ,
"boost": 2
}
}
},
{
"match": {
"searchable_content": {
"query": "Lupungen",
"fuzziness": "AUTO"
}
}
}
]
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
结果是:
{
"took": 27,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 30.744686,
"hits": [
{
"_index": "test_index_v1",
"_type": "article",
"_id": "18001",
"_score": 30.744686,
"_source": {
"id": 18001,
"pagetitle": "abc Lupungen ",
"searchable_content": "test Lupungen test abc double match Lupungen"
},
"highlight": {
"searchable_content": [
"test <em>Lupungen</em> test abc double match <em>Lupungen</em>"
],
"pagetitle": [
"abc <em>Lupungen</em> "
]
}
},
{
"_index": "test_index_v1",
"_type": "article",
"_id": "18000",
"_score": 4.4021354,
"_source": {
"id": 18000,
"pagetitle": "abc Lösungen and Lösungen",
"searchable_content": "test Lösungen test abc double match Lösungen"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc double match <em>Lösungen</em>"
],
"pagetitle": [
"abc <em>Lösungen</em> and <em>Lösungen</em>"
]
}
},
{
"_index": "test_index_v1",
"_type": "article",
"_id": "1005",
"_score": 4.019735,
"_source": {
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
},
"highlight": {
"searchable_content": [
"test! <em>Lösungen</em> test?"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index_v1",
"_type": "article",
"_id": "677",
"_score": 4.019735,
"_source": {
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test!"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index_v1",
"_type": "article",
"_id": "1006",
"_score": 3.0157328,
"_source": {
"id": 1006,
"pagetitle": "Lösungen Lösungen",
"searchable_content": "test"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em> <em>Lösungen</em>"
]
}
},
{
"_index": "test_index_v1",
"_type": "article",
"_id": "3263",
"_score": 2.7387147,
"_source": {
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "test"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index_v1",
"_type": "article",
"_id": "999",
"_score": 1.3864026,
"_source": {
"id": 999,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc double match Lösungen"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc double match <em>Lösungen</em>"
]
}
},
{
"_index": "test_index_v1",
"_type": "article",
"_id": "666",
"_score": 1.2810202,
"_source": {
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc"
]
}
}
]
}
}
预期的结果。首先,我得到了精确的查询匹配,后来的模糊结果与title和searchable_content中的匹配(以正确的方式排序),然后模糊的结果只在标题中匹配(以正确的方式排序),最后只在searchable_content中匹配的模糊结果(以正确的方式排序)