ElasticSearch:URL的聚合不断分割字段

时间:2015-10-07 20:38:40

标签: elasticsearch

我试图编写一个弹性搜索查询,将所有博客分组到相同的博客域(wordpress.com,blog.com等)。这就是我的查询的样子:

{
    "engagements": [
        "blogs"
    ],
    "query": {
        "query": {
            "filtered": {
                "query": {
                    "match_all": {}
                },
                "filter": {
                    "bool": {
                        "must": [
                            {
                                "range": {
                                    "weight": {
                                        "gte": 120,
                                        "lte": 150
                                    }
                                }
                            }
                        ]
                    }
                }
            }
        },
        "facets": {
            "my_facet": {
                "terms": {
                    "field": "blog_domain" <-------------------------------------
                }
            }
        }
    },
    "api": "_search"
}

然而,它正在归还:

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": 1,
        "hits": [
            ...
        ]
    },
    "facets": {
        "my_facet": {
            "_type": "terms",
            "missing": 0,
            "total": 21,
            "other": 3,
            "terms": [
                {
                    "term": "http",
                    "count": 3
                },
                {
                    "term": "noblepig.com",
                    "count": 2
                },
                {
                    "term": "hawaiian",
                    "count": 2
                },
                {
                    "term": "dream",
                    "count": 2
                },
                {
                    "term": "dessert",
                    "count": 2
                },
                {
                    "term": "2015",
                    "count": 2
                },
                {
                    "term": "05",
                    "count": 2
                },
                {
                    "term": "www.bt",
                    "count": 1
                },
                {
                    "term": "photos",
                    "count": 1
                },
                {
                    "term": "images.net",
                    "count": 1
                }
            ]
        }
    }
}

这不是我想要的。 现在我的数据库有三条记录:

"http://www.bt-images.net/8-cute-photos-cats/", 

"http://noblepig.com/2015/05/hawaiian-dream-dessert/", 

"http://noblepig.com/2015/05/hawaiian-dream-dessert/"

我希望它返回类似的内容:

    "facets": {
        "my_facet": {
            "_type": "terms",
            "missing": 0,
            "total": 21,
            "other": 3,
            "terms": [
                {
                    "term": "http://noblepig.com/2015/05/hawaiian-dream-dessert/",
                    "count": 2
                },
                {
                    "term": "http://www.bt-images.net/8-cute-photos-cats/",
                    "count": 1
                },

我该怎么做?我查了一下,看到人们推荐mappings,但我不知道在这个查询中把它放在哪里,我的表有1亿条记录,所以为时已晚。如果您有建议,可以粘贴整个查询吗?

使用aggs

时会发生同样的情况
{
    "engagements": [
        "blogs"
    ],
    "query": {
        "query": {
            "filtered": {
                "query": {
                    "match_all": {}
                },
                "filter": {
                    "bool": {
                        "must": [
                            {
                                "range": {
                                    "weight": {
                                        "gte": 13,
                                        "lte": 75
                                    }
                                }
                            }
                        ]
                    }
                }
            }
        },
        "aggs": {
            "blah": {
                "terms": {
                    "field": "blog_domain"
                }
            }
        }
    },
    "api": "_search"
}

1 个答案:

答案 0 :(得分:3)

执行此操作的正确方法是为该字段设置不同的映射。您可以通过向blog_domain添加子字段来更改路线上的映射,但无法更改已编入索引的文档。映射更改将对新文档生效。

为了提及这一点,您的blog_domain应如下所示:

    "blog_domain": {
      "type": "string",
      "fields": {
        "notAnalyzed": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }

意味着它应该有一个子字段(在我的示例中称为notAnalyzed),在您的聚合中,您应该使用blog_domain.notAnalyzed

但是,如果你不想或不能做出这种改变,有一种方法,但我认为它更慢:使用脚本聚合。像这样:

{
  "aggs": {
    "blah": {
      "terms": {
        "script": "_source.blog_domain", 
        "size": 10
      }
    }
  }
}

如果你没有启用它,则需要enable dynamic scripting