如何对密钥具有国际字符的聚合进行排序?

时间:2017-10-02 06:59:36

标签: elasticsearch unicode unicode-string elasticsearch-aggregation

鉴于一个包含人员列表,他们居住地点以及他们的财富/收入/税收水平的数据库,我已经给出了我的Elasticsearch 5.6.2这个映射:

mappings => {
    person => {
        properties => {
            name => {
                type   => 'text',
                fields => {
                    raw => {
                        type => 'keyword',
                    },
                },
            },

            county => {
                type   => 'text',
                fields => {
                    raw => {
                        type => 'keyword',
                    },
                },
            },

            community_name => {
                type   => 'text',
                fields => {
                    raw => {
                        type => 'keyword',
                    },
                },
            },

            wealth => {
                type => 'long',
            },

            income => {
                type => 'long',
            },

            tax => {
                type => 'long',
            },
        },
    },
},

一个县可以有几个社区,我希望进行汇总,以便为每个县和每个县的社区创建平均财富/收入/税收概览。

这似乎有效:

aggs => {
    counties => {
        terms => {
            field => 'county.raw',
            size  => 100,
            order => { _term => 'asc' },
        },

        aggs => {
            communities => {
                terms => {
                    field => 'community_name.raw',
                    size  => 1_000,
                    order => { _term => 'asc' },
                },

                aggs => {
                    avg_wealth => {
                        avg => {
                            field => 'wealth',
                        },
                    },

                    avg_income => {
                        avg => {
                            field => 'income',
                        },
                    },

                    avg_tax => {
                        avg => {
                            field => 'tax',
                        },
                    },
                },

            },

            avg_wealth => {
                avg => {
                    field => 'wealth',
                },
            },

            avg_income => {
                avg => {
                    field => 'income',
                },
            },

            avg_tax => {
                avg => {
                    field => 'tax',
                },
            },

        },

    },
},

但是,“county”和“community_name”没有正确排序,因为其中一些中有挪威字符,这意味着ES在“ØvreEiker”之前排序“Ål”,这是错误的。

如何实现正确的挪威排序?

编辑:我尝试将“community_name”字段更改为使用“icu_collat​​ion_keyword”而不是“keyword”:

community_name => {
    type   => 'text',
    fields => {
        raw => {
            type     => 'icu_collation_keyword',
            index    => 'false',
            language => 'nb',
        },
    },
},

但这会导致输出乱码:

Akershus - 276855 - 229202 - 80131
    ᦥ免⡠႐໠  - 314430 - 243684 - 87105
    ↘卑◥猔᠈〇㠖 - 202339 - 225665 - 78186
    ⚞乀⃠᷀  - 306985 - 237405 - 83186
    ⦘卓敫တ倎瀤 - 218060 - 218407 - 75602
    ⸳䄓†怜〨 - 271174 - 216843 - 75257

1 个答案:

答案 0 :(得分:0)

如果要进行聚合的字段(在您的示例中为community_name)始终只有一个值,那么我认为您可以尝试以下方法,这是您到目前为止的扩展。

基本上,您可以在原始的非乱码值上添加另一个子聚合,然后在客户端获取它以进行显示。

我将在简化的映射中显示它:

PUT /icu_index
{
    "mappings": {
        "my_type": {
            "properties": {
                "community": {
                    "type": "text",
                    "fields": {
                        "raw": {
                            "type": "keyword"
                        },
                        "norwegian": {
                            "type": "icu_collation_keyword",
                            "index": false,
                            "language": "nb"
                        }
                    }
                },
                "wealth": {
                    "type": "long"
                }
            }
        }
    }
}

我们将社区名称存储为:

  1. 不变为community;
  2. 作为keyword中的community.raw;
  3. 作为icu_collation_keyword中的community.norwegian
  4. 然后我们放了几个文件(注意:community_name有一个字符串参数,而不是字符串列表):

    PUT /icu_index/my_type/2
    {
        "community": "Ål",
        "wealth": 10000
    }
    
    PUT /icu_index/my_type/3
    {
        "community": "Øvre Eiker",
        "wealth": 5000
    }
    

    现在我们可以进行聚合:

    POST /icu_index/my_type/_search
    {
       "size": 0,
       "aggs": {
          "communities": {
             "terms": {
                "field": "community.norwegian",
                "order": { 
                    "_term": "asc"
                }
             },
             "aggs": {
                "avg_wealth": {
                   "avg": {
                      "field": "wealth"
                   }
                },
                "community_original": {
                    "terms": {
                        "field": "community.raw"
                    }
                }
             }
          }
       }
    }
    

    我们仍按community.norwegian排序,但我们也在community.raw添加子聚合。让我们看看结果:

       "aggregations": {
          "communities": {
             "doc_count_error_upper_bound": 0,
             "sum_other_doc_count": 0,
             "buckets": [
                {
                   "key": "⸳䃔楦၃৉瓅ᘂก捡㜂\u0000\u0001",
                   "doc_count": 1,
                   "community_original": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 0,
                      "buckets": [
                         {
                            "key": "Øvre Eiker",
                            "doc_count": 1
                         }
                      ]
                   },
                   "avg_wealth": {
                      "value": 5000
                   }
                },
                {
                   "key": "⸳䄏怠怜〨\u0000\u0000",
                   "doc_count": 1,
                   "community_original": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 0,
                      "buckets": [
                         {
                            "key": "Ål",
                            "doc_count": 1
                         }
                      ]
                   },
                   "avg_wealth": {
                      "value": 10000
                   }
                }
             ]
          }
       }
    

    现在,存储桶按照社区名称的ICU整理排序。密钥为"⸳䃔楦၃৉瓅ᘂก捡㜂\u0000\u0001"的第一个存储分区的原始值为community_original.buckets[0].key,即"Øvre Eiker"

    注意:如果community_name可以是值列表,那么这种黑客当然不会起作用。

    希望这个黑客有帮助!