我想对uri字段执行汇总,但仅返回网址的域部分,而不返回完整网址。例如,在字段https://stackoverflow.com/questions/ask?guided=true
中,我将得到stackoverflow.com
给定现有数据集,如下所示:
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "L9WewGoBZqCeOmbRIMlV",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.espnfc.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308177Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "MNWewGoBZqCeOmbRIsma",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.org/wiki/Category:Terrorism_in_Mexico",
"date" : "2019-05-16T12:29:08.1308803Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "2V-ewGoBiHg_1GebJKIr",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308811Z"
}
}
]
我的存储桶应如下:
"buckets" : [
{
"key" : "www.espnfc.com",
"doc_count" : 1
},
{
"key" : "www.wikipedia.com",
"doc_count" : 2
}
]
我具有以下有关如何进行聚合的代码段,但是,此代码段基于完整的URL而不是域名进行聚合
var searchResponse = client.Search<Person>(s =>
s.Size(0)
.Query(q => q
.MatchAll()
)
.Aggregations(a => a
.Terms("visited_pages", ta => ta
.Field(f => f.PageUri.Suffix("keyword"))
)
)
);
var aggregations = searchResponse.Aggregations.Terms("visited_pages");
我们将不胜感激:)
答案 0 :(得分:1)
我建议在提取过程中将该数据分成另一个字段(例如“ topleveldomain”),否则Elasticsearch必须对每个文档进行大量工作才能进行汇总。
答案 1 :(得分:1)
我已经使用了以下Terms Aggregation using Script。
请注意,在查看数据时,我想出了字符串逻辑。请对其进行测试,然后根据需要查找逻辑。
最好的方法是尝试使用一个名为hostname
的单独字段,该字段包含您要查找的值,并在其上应用聚合。
但是,如果您遇到困难,我想下面的汇总可以提供帮助!
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_unique_urls": {
"terms": {
"script" : {
"inline": """
String st = doc['pageUri.keyword'].value;
if(st==null){
return "";
} else {
return st.substring(0, st.lastIndexOf(".")+4);
}
""",
"lang": "painless"
}
}
}
}
}
以下是我的回复显示方式:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_unique_urls": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "http://www.espnfc.com",
"doc_count": 1
},
{
"key": "http://www.wikipedia.org",
"doc_count": 1
},
{
"key": "https://en.wikipedia.org",
"doc_count": 1
}
]
}
}
}
希望这会有所帮助!