ElasticSearch - 聚合仅在术语上执行吗?

时间:2015-11-28 01:07:38

标签: elasticsearch

假设我使用keyword标记器和lowercase过滤器,因此我的my_name字段"它是耐克鞋。"被标记为只有一个 术语 ["it is a nike shoe."]

使用聚合查询:

{
"size": 0,
"aggs" : {
    "my_aggs" : {
        "terms" : { "field" : "my_name" }
    }
}}

返回

"aggregations" : {
    "my_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "it is a nike shoe.",
        "doc_count" : 1
      } ]
    }
  }

所以我认为聚合是按术语行事的。这意味着我无法获得原创desc字段"它是耐克鞋。"。

我的问题是:

  1. 汇总仅在 期限 上执行,这意味着按期限汇总吗?

  2. 对于我的情况,是否可以使用聚合获取原始my_name字段(我想知道唯一的my_name值,而不是在标记化的术语之后......)?

2 个答案:

答案 0 :(得分:1)

1)是,汇总仅在存储在倒置索引

中的条款上执行

2)是的,您可以使用top hits aggregation

获取原始值

试试这个

public interface IFoo {
    public int getValueA();
    public int getValueB();
}

public FooFromFile implements IFoo {
    int a;
    int b;
    ...
    public int getValueA() {
        return a;
    }
    public int getValueB() {
        return b;
    }
}

public FooFromNetwork implements IFoo {
    int a;
    ...
    public int getValueA() {
        return a;
    }
    public int getValueB() {
        return 0; // return 0 because FooFromNetwork never gets value b.
    }
}

这有帮助吗?

答案 1 :(得分:0)

Top hits Aggregation可以包含更多信息,我们可以包含原始名称。聚合也可以与bool查询一起使用,这使得它非常灵活。

以下是一个示例,我们按“brandName”分组:

curl -XGET 'my-elasticsearch.com/test-aggs/_search?pretty=true' -d '{
  "size": 0,
  "query" : {
    "bool" : {
      "must" : [ {
        "match" : {
          "state" : {
            "query" : ["Active"],
            "type" : "boolean"
          }
        }
      }, {
        "match" : {
          "brandName" : {
            "query" : "nik",
            "type" : "phrase_prefix"
          }
        }
      } ]
    }
  },
  "aggs": {
    "my_aggs": {
      "terms": {
        "field": "brandName"
      },
      "aggs": {
        "my_top_hits": {
          "top_hits": {
            "size": 1, // for each term, how many matched "hit" are return ("_source" is included in "hit")
            "_source": {
              "include": "brandName"
            }
          }
        }
      }
    }
  }
}'

示例输出:

{
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "nike inc. international company",
        "doc_count" : 6,
        "my_top_hits" : {
          "hits" : {
            "total" : 6,
            "max_score" : 1.1467223,
            "hits" : [ {
              "_index" : "test-agg",
              "_type" : "Brand",
              "_id" : "AVFMf2jW9vvU7GxqHawa",
              "_score" : 1.1467223,
              "_source":{"brandName":"Nike Inc. International company"}
            } ]
          }
        }
      }, {
        "key" : "nike company",
        "doc_count" : 3,
        "my_top_hits" : {
          "hits" : {
            "total" : 3,
            "max_score" : 1.5016319,
            "hits" : [ {
              "_index" : "test-agg",
              "_type" : "Brand",
              "_id" : "AVFMjXOl9kfxoaJKgdxV",
              "_score" : 1.5016319,
              "_source":{"brandName":"NIKE Company"}
            } ]
          }
        }
      }, {
        "key" : "nikee...",
        "doc_count" : 1,
        "my_top_hits" : {
          "hits" : {
            "total" : 1,
            "max_score" : 1.6866593,
            "hits" : [ {
              "_index" : "test-agg",
              "_type" : "Brand",
              "_id" : "AVFMjaXi9vvU7GxqHawe",
              "_score" : 1.6866593,
              "_source":{"brandName":"NIKEE..."}
            } ]
          }
        }
      } ]
    }
  }
}

返回原始“brandName”。

虽然存在很大的局限性:

对于术语“nike”,可能有几个“brandName”被标记为“nike”。 E.x:[“NIKE”,“NIKE”,“Nike”,“nike”,“NIKE”,“Nike”,“Nike”]。这意味着无法在“top_hits”中指定“大小”,因为我们不知道有多少“brandName”被标记为术语“nike”(这没有任何好处而不是返回所有结果并且您编程获得唯一记录)。

另一个缺点是聚合不是高性能,它比匹配/术语查询慢得多。