假设我使用keyword
标记器和lowercase
过滤器,因此我的my_name
字段"它是耐克鞋。"被标记为只有一个 术语 :["it is a nike shoe."]
。
使用聚合查询:
{
"size": 0,
"aggs" : {
"my_aggs" : {
"terms" : { "field" : "my_name" }
}
}}
返回
"aggregations" : {
"my_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "it is a nike shoe.",
"doc_count" : 1
} ]
}
}
所以我认为聚合是按术语行事的。这意味着我无法获得原创desc
字段"它是耐克鞋。"。
我的问题是:
汇总仅在 期限 上执行,这意味着按期限汇总吗?
对于我的情况,是否可以使用聚合获取原始my_name
字段(我想知道唯一的my_name
值,而不是在标记化的术语之后......)?
答案 0 :(得分:1)
1)是,汇总仅在存储在倒置索引
中的条款上执行2)是的,您可以使用top hits aggregation
获取原始值试试这个
public interface IFoo {
public int getValueA();
public int getValueB();
}
public FooFromFile implements IFoo {
int a;
int b;
...
public int getValueA() {
return a;
}
public int getValueB() {
return b;
}
}
public FooFromNetwork implements IFoo {
int a;
...
public int getValueA() {
return a;
}
public int getValueB() {
return 0; // return 0 because FooFromNetwork never gets value b.
}
}
这有帮助吗?
答案 1 :(得分:0)
Top hits Aggregation可以包含更多信息,我们可以包含原始名称。聚合也可以与bool
查询一起使用,这使得它非常灵活。
以下是一个示例,我们按“brandName”分组:
curl -XGET 'my-elasticsearch.com/test-aggs/_search?pretty=true' -d '{
"size": 0,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"state" : {
"query" : ["Active"],
"type" : "boolean"
}
}
}, {
"match" : {
"brandName" : {
"query" : "nik",
"type" : "phrase_prefix"
}
}
} ]
}
},
"aggs": {
"my_aggs": {
"terms": {
"field": "brandName"
},
"aggs": {
"my_top_hits": {
"top_hits": {
"size": 1, // for each term, how many matched "hit" are return ("_source" is included in "hit")
"_source": {
"include": "brandName"
}
}
}
}
}
}
}'
示例输出:
{
"took" : 37,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"my_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "nike inc. international company",
"doc_count" : 6,
"my_top_hits" : {
"hits" : {
"total" : 6,
"max_score" : 1.1467223,
"hits" : [ {
"_index" : "test-agg",
"_type" : "Brand",
"_id" : "AVFMf2jW9vvU7GxqHawa",
"_score" : 1.1467223,
"_source":{"brandName":"Nike Inc. International company"}
} ]
}
}
}, {
"key" : "nike company",
"doc_count" : 3,
"my_top_hits" : {
"hits" : {
"total" : 3,
"max_score" : 1.5016319,
"hits" : [ {
"_index" : "test-agg",
"_type" : "Brand",
"_id" : "AVFMjXOl9kfxoaJKgdxV",
"_score" : 1.5016319,
"_source":{"brandName":"NIKE Company"}
} ]
}
}
}, {
"key" : "nikee...",
"doc_count" : 1,
"my_top_hits" : {
"hits" : {
"total" : 1,
"max_score" : 1.6866593,
"hits" : [ {
"_index" : "test-agg",
"_type" : "Brand",
"_id" : "AVFMjaXi9vvU7GxqHawe",
"_score" : 1.6866593,
"_source":{"brandName":"NIKEE..."}
} ]
}
}
} ]
}
}
}
返回原始“brandName”。
虽然存在很大的局限性:
对于术语“nike”,可能有几个“brandName”被标记为“nike”。 E.x:[“NIKE”,“NIKE”,“Nike”,“nike”,“NIKE”,“Nike”,“Nike”]。这意味着无法在“top_hits”中指定“大小”,因为我们不知道有多少“brandName”被标记为术语“nike”(这没有任何好处而不是返回所有结果并且您编程获得唯一记录)。
另一个缺点是聚合不是高性能,它比匹配/术语查询慢得多。