Question

I have a collection of addresses. Let's simplify and say the only fields are postcode, city, street, streetnumber and name. I'd like to be able to suggest a list of streets when the user enters a postcode, a city and some query for the street.

For example, if the user, in a HTML form, enters:

postcode: 75010
city: Paris
street: rue des

I'd like to get a list of streets like

'rue des petites écuries'
'rue des messageries'
...
'rue du faubourg poissonnière'
...

that I could suggest to the user.

So, I'd like to obtain a list of unique values of the "street" field, sorted according to how well they match my query on the "street" field. I'd like to obtain the 10 best matching streets for this query.

A query returning documents would look like:

{
    "query": {
        "bool": {
            "must": [
                {{"term": {"postcode": "75010"}},
                {{"term": {city": "Paris"}},
                {{"match": {"street": "rue des"}}
            ]    
        }
     }
}

But of course you would get the same street appear many times, since each street can appear multiple times in differerent addresses in the collection.

I tried to use the "aggregation" framework and added an aggs:

{
    "query": {
        "bool": {
            "must": [
                {{"term": {"postcode": "75010"}},
                    {{"term": {city": "Paris"}},
                    {{"match": {"street": "rue des"}}
            ]    
        }
     },
     "aggs": {
        "street_agg": {
            "terms": {
                "field": "street",
                "size": 10
             }
         }           
     }
}

The problem is that it's automatically sorted, not according to the score, but according to the number of documents in each bucket.

I'd like to have the buckets sorted by the score of an arbitrary document picked in each bucket (yes, it's enough to get the score from a single document in a bucket since the score depends only on the content of the street field in my example).

How would you achieve that?

Answer 1

好的，所以解决方案实际上可以在Elasticsearch aggregation order by top hit score中找到，但前提是你读过Shadocko的评论：Elasticsearch aggregation order by top hit score，我没有。

所以这里的解决方案适合任何有兴趣的人，以及我未来的自我：

{                                 
    'query': {
        'bool': {
            'must': [
                {'term': {'postcode': '75010'}},
                {'term': {'city': 'Paris'}},
                {'match': {'street.autocomplete': 'rue des'}}
            ]
         }
    },
    'aggs': {
        'street_agg': {
            'terms': {
                'field': 'street',
                'size': 10,
                'order': {
                    'max_score': 'desc'
                }
            },
            'aggs': {
                'max_score': {
                    'max': {'script': '_score'}
                }
            }
        }
    }
}

它并不完美，因为它使用max聚合函数，这意味着它会进行不必要的计算（只需将一个文档的分数从桶中取出就足够了）。但似乎没有“选择一个”聚合函数，只有min，max，avg和sum，所以你必须这样做。好吧，我认为计算最大值并不是那么昂贵。

Elasticsearch: how to get the top unique values of a field sorted by matching score?

1 个答案: