我正在尝试在我创建的标签(关键字短语)中启用全文搜索,这些标签可以分配给索引中的文档(名为" Delta")。
我的结果是(1)不是我所期望的,(2)如果我重复重复运行相同的代码则不一致。
下面是一些代码。我简化了映射和文档以使代码更清晰,并确保问题不在文档或映射的其他部分中。我正在使用Kibana Dev Tools控制台运行所有这些。
PUT /mdelta
{
"mappings":{
"tags":{
"properties":{
"synonyms":{
"type":"text"
}
}
}
}
}
POST _bulk
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Fe"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Deficiency"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Serum Iron"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Sulfate"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Deficiency Anemia"}
GET mdelta/tags/_search
{
"explain":false,
"query": {
"match" : {
"synonyms" : "iron"
}
}
}
根据我对评分算法的理解,我希望首先返回文档{"synonyms":"Iron"}
(最高分)。不是这种情况。结果......
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.5377023,
"hits": [
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj9",
"_score": 0.5377023,
"_source": {
"synonyms": "Iron Sulfate"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj5",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj8",
"_score": 0.25811607,
"_source": {
"synonyms": "Serum Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj7",
"_score": 0.1805489,
"_source": {
"synonyms": "Iron Deficiency"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj-",
"_score": 0.14638957,
"_source": {
"synonyms": "Iron Deficiency Anemia"
}
}
]
}
}
我重复了查询设置为true的查询。
{
"took": 38,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.5377023,
"hits": [
{
"_shard": "[mdelta][4]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj9",
"_score": 0.5377023,
"_source": {
"synonyms": "Iron Sulfate"
},
"_explanation": {
"value": 0.5377023,
"description": "weight(synonyms:iron in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.5377023,
"description": "score(doc=1,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.6931472,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.7757405,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 1.5,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][2]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj5",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron"
},
"_explanation": {
"value": 0.2876821,
"description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.2876821,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 1,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 1,
"description": "avgFieldLength",
"details": []
},
{
"value": 1,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][3]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj8",
"_score": 0.25811607,
"_source": {
"synonyms": "Serum Iron"
},
"_explanation": {
"value": 0.25811607,
"description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.25811607,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.89722675,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][1]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj7",
"_score": 0.1805489,
"_source": {
"synonyms": "Iron Deficiency"
},
"_explanation": {
"value": 0.1805489,
"description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.1805489,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.18232156,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 2,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.9902773,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2.5,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][1]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj-",
"_score": 0.14638957,
"_source": {
"synonyms": "Iron Deficiency Anemia"
},
"_explanation": {
"value": 0.14638956,
"description": "weight(synonyms:iron in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.14638956,
"description": "score(doc=1,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.18232156,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 2,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.8029196,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2.5,
"description": "avgFieldLength",
"details": []
},
{
"value": 4,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
}
]
}
}
如果你看第一个点击("铁硫酸盐"),似乎docFreq是1而docCount是2.这是不正确的。
此外,如果我运行delete /mdelta
然后重新运行我的代码,我可以获得不同的结果顺序,例如......
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.2876821,
"hits": [
{
"_index": "mdelta",
"_type": "tags",
"_id": "Qd0JQWABt4cFDxBHv7Fe",
"_score": 0.2876821,
"_source": {
"synonyms": "Serum Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "Pt0JQWABt4cFDxBHv7Fe",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "QN0JQWABt4cFDxBHv7Fe",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron Deficiency"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "Qt0JQWABt4cFDxBHv7Fe",
"_score": 0.19856805,
"_source": {
"synonyms": "Iron Sulfate"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "Q90JQWABt4cFDxBHv7Fe",
"_score": 0.16853254,
"_source": {
"synonyms": "Iron Deficiency Anemia"
}
}
]
}
}
非常感谢任何关于我做错事的想法。
答案 0 :(得分:3)
在重新索引数据时未获得一致结果的原因是术语频率是根据每个分片计算的。在重建索引时,分片分配与之前的索引不同,因为您没有指定任何路由。
问题:
来自弹性的没有得到你的期望
可能是因为索引中的文档数量很少。尝试使用参数search_type
运行查询,如下所示:GET mdelta/tags/_search?search_type= dfs_query_then_fetch
。
这可确保首先计算索引级别频率。
您可以在开发中使用它,但我不认为它在生产中是可取的。如果您有足够的数据,则碎片的频率应该大致相同。
请参阅:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html