如何突出显示Elasticsearch中的嵌套字段

时间:2015-01-30 19:59:20

标签: elasticsearch lucene nested nested-query

虽然 Lucene 逻辑结构,但是当我们的内容中存在某些搜索结果时,我正在尝试突出显示嵌套字段

以下是来自 Elasticsearch文档的说明(映射nested type`)

  

内部实施

     

在内部,嵌套对象被索引为附加文档,但是,由于可以保证它们在同一“块”中被索引,因此可以非常快速地与父文档连接。

     

对索引执行操作时会自动屏蔽掉那些内部嵌套文档(比如使用match_all查询进行搜索),并且在使用嵌套查询时它们会冒泡。

     

由于嵌套文档始终屏蔽到父文档,因此永远不能在嵌套查询的范围之外访问嵌套文档。例如,可以在嵌套对象内的字段上启用存储字段,但无法检索它们,因为存储字段是在嵌套查询范围之外提取的。

0。在我的情况下

我有一个 Elasticsearch 索引,其中包含映射,如下所示:

{
    "my_documents": {
        "dynamic_date_formats": [
            "dd.MM.yyyy",
            "yyyy-MM-dd",
            "yyyy-MM-dd HH:mm:ss"
        ],
        "index_analyzer": "Analyzer2_index",
        "search_analyzer": "Analyzer2_search_decompound",
        "_timestamp": {
            "enabled": true
        },
        "properties": {
            "identifier": {
                "type": "string"
            },
            "description": {
                "type": "multi_field",
                "fields": {
                    "sort": {
                        "type": "string",
                        "index": "not_analyzed"
                    },
                    "description": {
                        "type": "string"
                    }
                }
            },
            "files": {
                "type": "nested",
                "include_in_root": true,
                "properties": {
                    "content": {
                        "type": "string",
                        "include_in_root": true
                    }
                }
            },
            "and then some other": "normal string fields"
        }
    }
}

我正在尝试执行这样的查询:

{
    "size": 100,
    "query": {
        "bool": {
            "should": [
                {
                    "nested": {
                        "path": "files",
                        "query": {
                            "bool": {
                                "should": {
                                    "match": {
                                        "content": {
                                            "query": "burpcontrol",
                                            "minimum_should_match": "85%"
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "match": {
                        "description": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                },
                {
                    "match": {
                        "identifier": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                }            ]
        }
    },
    "highlight": {
        "pre_tags": [
            "<span style=\"background-color: yellow\">"
        ],
        "post_tags": [
            "</span>"
        ],
        "order": "score",
        "no_match_size": 100,
        "fragment_size": 50,
        "number_of_fragments": 3,
        "require_field_match": true,
        "fields": {
            "files.content": {},
            "description": {},
            "identifier": {}
        }
    }
}

我遇到的问题是:

1。 require_field_match

如果我使用"require_field_match": false我得到了,即使突出显示不适用于嵌套字段,搜索字词仍会在 ALL 字段中突出显示。 这是我实际使用的解决方案,但表现非常糟糕。对于50个文档,我的查询需要25秒。 100个文件约50secs。 10个文件5个。 如果我从突出显示中移除嵌套字段,那么所有内容都可以快速运行!

2.include_in_root

我希望展平版本的嵌套字段(因此要将它们存储为普通对象 / 字段< / em>的。 要做到这一点,我应该指定

  

“files”:{“type”:“nested”,“ include_in_root ”:true,...

但是我不知道为什么在重建索引之后,我在文档根目录中看不到任何额外的扁平化字段(我期待像"files.content":["content1", "content2", "..."]这样的东西)。

如果它可以工作,则可以访问(在展平的字段中)嵌套字段的内容,并在其上执行突出显示。

您知道是否可以在嵌套字段上实现良好(并且高效)的突出显示,或者至少建议我为什么我的查询速度太慢? (我已经优化了片段)

1 个答案:

答案 0 :(得分:6)

您可以在这里做很多事情,有父/子关系。我会过几点,希望这会引导你朝着正确的方向前进;它仍然需要进行大量测试才能确定这种解决方案是否会对您更有效。另外,为了清楚起见,我省略了一些设置细节。请原谅长篇文章。

我按如下方式设置父/子映射:

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "parent_doc": {
         "properties": {
            "identifier": {
               "type": "string"
            },
            "description": {
               "type": "string"
            }
         }
      },
      "child_doc": {
         "_parent": {
            "type": "parent_doc"
         },
         "properties": {
            "content": {
               "type": "string"
            }
         }
      }
   }
}

然后添加了一些测试文档:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"parent_doc","_id":1}}
{"identifier": "first", "description":"some special text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is special"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is not"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":2}}
{"identifier": "second", "description":"some different text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":2}}
{"content":"different child text, but special"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":3}}
{"identifier": "third", "description":"we don't want this parent"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":3}}
{"content":"or this child"}

如果我正确理解您的规范,我们希望查询"special"以返回除最后两个之外的所有文档(如果我错了,请更正我)。我们需要与文本匹配的文档,具有与文本匹配的子项,或者具有与文本匹配的父项。

我们可以像这样找回符合查询的父母:

POST /test_index/parent_doc/_search
{
    "query": {
        "match": {
           "description": "special"
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         }
      ]
   }
}

我们可以像这样找回与查询匹配的孩子:

POST /test_index/child_doc/_search
{
    "query": {
        "match": {
           "content": "special"
        }
    },
    "highlight": {
        "fields": {
            "content": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.92364895,
      "hits": [
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.92364895,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.80819285,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

我们可以找回与文本匹配的父母以及与此文本匹配的子项:

POST /test_index/parent_doc,child_doc/_search
{
    "query": {
        "multi_match": {
           "query": "special",
           "fields": ["description", "content"]
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    }
}
...
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.75740534,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.6627297,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

但是,要获取与此查询相关的所有文档,我们需要使用bool查询:

POST /test_index/parent_doc,child_doc/_search
{
   "query": {
      "bool": {
         "should": [
            {
               "multi_match": {
                  "query": "special",
                  "fields": [
                     "description",
                     "content"
                  ]
               }
            },
            {
               "has_child": {
                  "type": "child_doc",
                  "query": {
                     "match": {
                        "content": "special"
                     }
                  }
               }
            },
            {
               "has_parent": {
                  "type": "parent_doc",
                  "query": {
                     "match": {
                        "description": "special"
                     }
                  }
               }
            }
         ]
      }
   },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    },
    "fields": ["_parent", "_source"]
}
...
{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 0.8866254,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 0.8866254,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.67829096,
            "_source": {
               "content": "text that is special"
            },
            "fields": {
               "_parent": "1"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.18709806,
            "_source": {
               "content": "different child text, but special"
            },
            "fields": {
               "_parent": "2"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "NiwsP2VEQBKjqu1M4AdjCg",
            "_score": 0.12531912,
            "_source": {
               "content": "text that is not"
            },
            "fields": {
               "_parent": "1"
            }
         },
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "2",
            "_score": 0.12531912,
            "_source": {
               "identifier": "second",
               "description": "some different text"
            }
         }
      ]
   }
}

(我添加了"_parent"字段,以便更轻松地查看结果中包含文档的原因,如here所示。)

如果有帮助,请告诉我。

以下是我使用的代码:

http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6