Elasticsearch通过过滤的子文档计数过滤父母

时间:2016-01-04 21:25:28

标签: elasticsearch

我正在尝试对我拥有的一组数据执行一些弹性搜索查询。 我有一个用户文档,它是许多子页面视图文档的父级。我希望返回已经查看特定页面任意次数的所有用户(由用户输入框定义)。到目前为止,我有一个has_child查询,它将返回所有具有某些id的页面视图的用户。然而,这将使那些父母带着他们所有的孩子回归。接下来,我尝试在这些查询结果上编写聚合,这将基本上以聚合形式执行相同的has_child查询。现在,我有过滤子文档的正确文档计数。我需要使用此文档计数返回并过滤父项。要用单词解释查询,“将查看特定页面的所有用户返回给我4次以上”。我可能需要重构我的数据。有什么想法吗?

到目前为止,这是我的查询:

curl -XGET 'http://localhost:9200/development_users/_search?pretty=true' -d '
{
    "query" : { 
      "has_child" : {
        "type" : "page_view",
        "query" : {
          "terms" : {
            "viewed_id" : [175,180]
          }
        }
      }
    },
    "aggs" : {
      "to_page_view": {
        "children": {
          "type" : "page_view"
        },
        "aggs" : {
          "page_views_that_match" : {
            "filter" : { "terms": { "viewed_id" : [175,180] } }
          }
        }
      }
    }
}'

这给我的回复如下:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "development_users",
      "_type" : "user",
      "_id" : "22548",
      "_score" : 1.0,
      "_source":{"id":22548,"account_id":1009}
    } ]
  },
  "aggregations" : {
    "to_page_view" : {
      "doc_count" : 53,
      "page_views_that_match" : {
        "doc_count" : 2
      }
    }
  }
}

相关映射:

{
  "development_users" : {
    "mappings" : {
      "page_view" : {
        "dynamic" : "false",
        "_parent" : {
          "type" : "user"
        },
        "_routing" : {
          "required" : true
        },
        "properties" : {
          "created_at" : {
            "type" : "date",
            "format" : "date_time"
          },
          "id" : {
            "type" : "integer"
          },
          "viewed_id" : {
            "type" : "integer"
          },
          "time_on_page" : {
            "type" : "integer"
          },
          "title" : {
            "type" : "string"
          },
          "type" : {
            "type" : "string"
          },
          "updated_at" : {
            "type" : "date",
            "format" : "date_time"
          },
          "url" : {
            "type" : "string"
          }
        }
      },
      "user" : {
        "dynamic" : "false",
        "properties" : {
          "account_id" : {
            "type" : "integer"
          },
          "id" : {
            "type" : "integer"
          }
        }
      }
    }
  }
}

1 个答案:

答案 0 :(得分:5)

好的,所以这是一种参与。我做了一些简化,以保持在我脑海中。首先,我使用了这种映射:

PUT /test_index
{
    "mappings": {
        "page_view": {
            "_parent": {
               "type": "development_user"
            },
            "properties": {
                "viewed_id": {
                    "type": "string"
                }
            }
        },
        "development_user": {
            "properties": {
                "id": {
                    "type": "string"
                }
            }
        }
    }
}

然后我添加了一些数据。在这个小小的宇宙中,我有三个用户和两个页面。我想找到至少两次查看"page_a"的用户,因此如果构建正确的查询,则只会返回用户3

POST /test_index/development_user/_bulk
{"index":{"_type":"development_user","_id":1}}
{"id":"user_1"}
{"index":{"_type":"page_view","_parent":1}}
{"viewed_id":"page_a"}
{"index":{"_type":"development_user","_id":2}}
{"id":"user_2"}
{"index":{"_type":"page_view","_parent":2}}
{"viewed_id":"page_b"}
{"index":{"_type":"development_user","_id":3}}
{"id":"user_3"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_b"}

要获得该答案,我们将使用聚合。请注意,我不想要返回文档(正常方式),但我确实希望过滤掉我们分析的文档,因为它会提高效率。所以我使用的是与之前相同的基本过滤器。

因此聚合树以terms_parent_id开头,它只是分隔父文档。在里面,我children_page_view将子文档过滤到我想要的文件("page_a"),并且层次结构旁边的bucket_selector_page_id_term_count使用bucket selector(您需要ES 2.x)来满足那些符合标准的人的文件,然后最后一个top hits aggregation向我们展示符合要求的文件。

POST /test_index/development_user/_search
{
   "size": 0,
   "query": {
      "has_child": {
         "type": "page_view",
         "query": {
            "terms": {
               "viewed_id": [
                  "page_a"
               ]
            }
         }
      }
   },
   "aggs": {
      "terms_parent_id": {
         "terms": {
            "field": "id"
         },
         "aggs": {
            "children_page_view": {
               "children": {
                  "type": "page_view"
               },
               "aggs": {
                  "filter_page_ids": {
                     "filter": {
                        "terms": {
                           "viewed_id": [
                              "page_a"
                           ]
                        }
                     }
                  }
               }
            },
            "bucket_selector_page_id_term_count": {
               "bucket_selector": {
                  "buckets_path": {
                     "children_count": "children_page_view>filter_page_ids._count"
                  },
                  "script": "children_count >= 2"
               }
            },
            "top_hits_users": {
               "top_hits": {
                  "_source": {
                     "include": [
                        "id"
                     ]
                  }
               }
            }
         }
      }
   }
}

返回:

{
   "took": 14,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "terms_parent_id": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "user_3",
               "doc_count": 1,
               "children_page_view": {
                  "doc_count": 3,
                  "filter_page_ids": {
                     "doc_count": 2
                  }
               },
               "top_hits_users": {
                  "hits": {
                     "total": 1,
                     "max_score": 1,
                     "hits": [
                        {
                           "_index": "test_index",
                           "_type": "development_user",
                           "_id": "3",
                           "_score": 1,
                           "_source": {
                              "id": "user_3"
                           }
                        }
                     ]
                  }
               }
            }
         ]
      }
   }
}

这是我使用的所有代码:

http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f