我正在尝试抓取按域分组的前10个文档。这10个文档需要具有“crawl_date”的值,这些值尚未被抓取一段时间或者根本没有被抓取(例如空值)。我有:
curl -XPOST 'http://localhost:9200/tester/test/_search' -d '
{
"size": 10,
"aggs": {
"group_by_domain": {
"filter": {
"or":[
"term": {"crawl_date": ""},
"term": {"crawl_date": ""} // how do I put a range here? e.g. <= '2014-12-31'
]
},
"terms": {
"field": "domain"
}
}
}
}'
我是ES新手并使用2.2版本。由于文档没有完全更新,我正在努力。
编辑: 为了澄清,我需要10个尚未被抓取或暂时未被抓取的网址。这10个网址中的每一个都必须来自一个独特的域名,这样当我抓取它们时,我不会使某人的服务器超载。
另一个编辑: 所以,我需要这样的东西(10个独特域中的每一个都有1个链接):
1. www.domain1.com/page
2. www.domain2.com/url
etc...
相反,我只获得了域名和页面数量:
"buckets": [
{
"key": "http://www.dailymail.co.uk",
"doc_count": 212
},
{
"key": "https://sedo.com",
"doc_count": 196
},
{
"key": "http://www.foxnews.com",
"doc_count": 118
},
{
"key": "http://data.worldbank.org",
"doc_count": 117
},
{
"key": "http://detail.1688.com",
"doc_count": 117
},
{
"key": "https://twitter.com",
"doc_count": 112
},
{
"key": "http://search.rakuten.co.jp",
"doc_count": 104
},
{
"key": "https://in.1688.com",
"doc_count": 92
},
{
"key": "http://www.abc.net.au",
"doc_count": 87
},
{
"key": "http://sport.lemonde.fr",
"doc_count": 85
}
]
“点击”只返回1个域的多个页面:
"hits": [
{
"_index": "tester",
"_type": "test",
"_id": "http://www.barnesandnoble.com/w/at-the-edge-of-the-orchard-tracy-chevalier/1121908441?ean=9780525953005",
"_score": 1,
"_source": {
"domain": "http://www.barnesandnoble.com",
"crawl_date": "0001-01-01T00:00:00Z"
}
},
{
"_index": "tester",
"_type": "test",
"_id": "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb",
"_score": 1,
"_source": {
"domain": "http://www.barnesandnoble.com",
"crawl_date": "0001-01-01T00:00:00Z"
}
},
etc....
如果我试图同时抓取那么多域名,Barnes and Noble会很快禁止我的UA。
我需要这样的东西:
1. "http://www.dailymail.co.uk/page/text.html",
2. "https://sedo.com/another/page"
3. "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb"
4. "http://www.starbucks.com/homepage/"
etc.
答案 0 :(得分:3)
如果你想使用聚合,我建议使用聚合术语来删除结果集中的重复项,作为子聚合,我会使用top_hits aggregation,这样可以获得最佳效果。每个域的聚合文档(默认情况下,域中每个文档的分数应该相同。)
因此查询将如下所示:
POST sites/page/_search
{
"size": 0,
"aggs": {
"filtered_domains": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2016-01-01"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "domain",
"size": 10
},
"aggs": {
"pages": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
给你这样的结果
"aggregations": {
"filtered_domains": {
"doc_count": 3,
"domains": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "barnesandnoble.com",
"doc_count": 2,
"pages": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "page",
"_id": "barnesandnoble.com/test2.html",
"_score": 1,
"_source": {
"crawl_date": "1982-05-16",
"domain": "barnesandnoble.com"
}
}
]
}
}
},
{
"key": "starbucks.com",
"doc_count": 1,
"pages": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "page",
"_id": "starbucks.com/index.html",
"_score": 1,
"_source": {
"crawl_date": "1982-05-16",
"domain": "starbucks.com"
}
}
]
}
}
}
]
}
}
如果您可以更改索引结构,我建议使用父/子关系或嵌套文档创建索引。
如果您这样做,您可以选择10个不同的域并检索此网址的一个(或多个)特定页面。
让我向您展示一个父/子的例子(如果您使用sense,您应该只能复制粘贴):
首先生成文档的映射:
PUT /sites
{
"mappings": {
"domain": {},
"page": {
"_parent": {
"type": "domain"
},
"properties": {
"crawl_date": {
"type": "date"
}
}
}
}
}
插入一些文件
PUT sites/domain/barnesandnoble.com
{}
PUT sites/domain/starbucks.com
{}
PUT sites/domain/dailymail.co.uk
{}
POST /sites/page/_bulk
{ "index": { "_id": "barnesandnoble.com/test.html", "parent": "barnesandnoble.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "barnesandnoble.com/test2.html", "parent": "barnesandnoble.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "starbucks.com/index.html", "parent": "starbucks.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "dailymail.co.uk/index.html", "parent": "dailymail.co.uk" }}
{}
搜索要抓取的网址
POST /sites/domain/_search
{
"query": {
"has_child": {
"type": "page",
"query": {
"bool": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2016-01-01"
}
}
}]
}
}
}
},
"inner_hits": {
"size": 1
}
}
}
}
我们对父类型执行 has_child 查询,因此只接收父类型的不同网址。要获取特定页面,我们必须添加inner_hits query,它会为我们提供导致父类型中命中的子文档。 如果将inner_hits size设置为1,则每个域只能获得一个页面。 您甚至可以在inner_hits查询中添加排序...例如,您可以按crawl_date排序。 ;)
上述搜索会给您以下结果:
"hits": [
{
"_index": "sites",
"_type": "domain",
"_id": "starbucks.com",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 1,
"max_score": 1.9664046,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "starbucks.com/index.html",
"_score": 1.9664046,
"_routing": "starbucks.com",
"_parent": "starbucks.com",
"_source": {
"crawl_date": "1982-05-16"
}
}
]
}
}
}
},
{
"_index": "sites",
"_type": "domain",
"_id": "dailymail.co.uk",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 1,
"max_score": 1.9664046,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "dailymail.co.uk/index.html",
"_score": 1.9664046,
"_routing": "dailymail.co.uk",
"_parent": "dailymail.co.uk",
"_source": {}
}
]
}
}
}
},
{
"_index": "sites",
"_type": "domain",
"_id": "barnesandnoble.com",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 2,
"max_score": 1.4142135,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "barnesandnoble.com/test.html",
"_score": 1.4142135,
"_routing": "barnesandnoble.com",
"_parent": "barnesandnoble.com",
"_source": {
"crawl_date": "1982-05-16"
}
}
]
}
}
}
}
]
最后,让我注意一件事。在查询时,父/子关系的成本很低。如果这对您的用例不是问题,我会选择此解决方案。
答案 1 :(得分:2)
我建议您使用exists
filter而不是尝试匹配空term
(2.2中不推荐使用missing
filter)。然后,range
filter将帮助您过滤掉您不需要的文档。
最后,由于您已使用绝对网址作为ID,因此请确保在_uid
字段而不是域字段进行汇总,这样您就可以获得每个确切网页的唯一计数。
curl -XPOST 'http://localhost:9200/tester/test/_search' -d '{
"size": 10,
"aggs": {
"group_by_domain": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2014-12-31T00:00:00.000"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "_uid"
}
}
}
}
}
}'
答案 2 :(得分:0)
您必须使用Filter Aggregation然后使用子聚合
{
"size": 10,
"aggs": {
"filter_date": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": [
{
"exists": {
"field": "crawl_date"
}
}
]
}
},
{
"range": {
"crawl_date": {
"from": "now-100d"
}
}
}
]
}
},
"aggs": {
"group_by_domain": {
"terms": {
"field": "domain"
}
}
}
}
}
}