按术语分组并获取数组键及其计数列表

时间:2015-03-23 14:21:45

标签: elasticsearch

在弹性搜索中,我得到了数十万个大致有这种结构的文档:

{
  "script": "/index.html",
  "query": {
    "ab": "hello",
    "cd": "world",
    "ef": "123"
}

网址" http://localhost/index.html?ab=hello&cd=world&ef=123"被解析成它。 "脚本"只包含路径和目标脚本 - 根本没有查询。 查询数组不包含相同的键列表,当然也包含不同的值,这在当下并不重要。

我知道,我能够获得一个独特的"脚本列表"用:

{
  "aggregations": {
    "my_agg": {
      "terms": {
        "field": "script.raw"
      }
    }
  }
}

导致多个桶,如

"buckets": [
{
    "key": "/index.html",
    "doc_count": 123456
},
{
    "key": "/hello.html",
    "doc_count": 1456
},
...

我的问题:是否有办法额外获取所有查询的列表和计数,这些都发生在不同的网址中?

类似的东西:

"buckets": [
{
    "key": "/index.html",
    "doc_count": 123456,
    "query_key_count": {
      "ab": 33456,
      "cd": 3456,
      "ef": 456,
      "gh": 56,
      "ij": 6
    }
},
{
    "key": "/hello.html",
    "doc_count": 1456,
    "query_key_count": {
      "zy": 156,
      "gh": 6
    }
},
...

非常感谢!!

1 个答案:

答案 0 :(得分:0)

为了利用Elasticsearch的优势,您确实需要将您的文档结构化为:

{
   "script": "/index.html",
   "query": [
      {
         "query_key": "ab",
         "query_val": "hello"
      },
      {
         "query_key": "cd",
         "query_val": "world"
      },
      {
         "query_key": "ef",
         "query_val": "123"
      }
   ]
}

如果我使用nested type设置映射:

PUT /test_index
{
   "mappings": {
      "doc": {
         "properties": {
            "query": {
               "type": "nested",
               "properties": {
                  "query_key": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "query_val": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            },
            "script": {
               "type": "string",
               "index": "not_analyzed"
            }
         }
      }
   }
}

并添加几个文档:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"script": "/index.html","query": [{"query_key":"ab", "query_val":"hello"},{"query_key":"cd", "query_val":"world"}, {"query_key":"ef", "query_val":"123"}]}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"script": "/index.html","query": [{"query_key":"ab", "query_val":"foo"},{"query_key":"cd", "query_val":"bar"}, {"query_key":"gh", "query_val":"456"}]}

我可以在nested术语聚合中找回查询密钥:

POST /test_index/_search?search_type=count
{
   "aggs": {
      "resellers": {
         "nested": {
            "path": "query"
         },
         "aggs": {
            "query_keys": {
               "terms": {
                  "field": "query.query_key"
               }
            }
         }
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "resellers": {
         "doc_count": 6,
         "query_keys": {
            "buckets": [
               {
                  "key": "ab",
                  "doc_count": 2
               },
               {
                  "key": "cd",
                  "doc_count": 2
               },
               {
                  "key": "ef",
                  "doc_count": 1
               },
               {
                  "key": "gh",
                  "doc_count": 1
               }
            ]
         }
      }
   }
}

这是我使用的代码:

http://sense.qbox.io/gist/aecd92e5903f644e28c802860a90a86bdd7f97ee