如何在ElasticSearch中进行聚合查询时忽略尾随空格

时间:2015-10-01 06:37:24

标签: elasticsearch

我有一个汇总查询,可以将哪个存储桶设为国家/地区的城市名称。查询(我在意义上说)如下:

GET test/_search
{

  "query" : {
"bool" : {
  "must" : {
    "match" : {
      "name.autocomplete" : {
        "query" : "new yo",
        "type" : "boolean"
      }
    }
  },
  "must_not" : {
    "term" : {
      "source" : "old"
    }
  }
}
  },
  "aggregations" : {
"city_name" : {
  "terms" : {
    "field" : "cityname.raw",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "field" : "countryname.raw"
         }
       }
     }
   }
 }
}

现在文档New York出现两次,带有额外的尾随空格。我得到的聚合结果如下:

{
     "key": "New York",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  },
  {
     "key": "New York ",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  }

我需要同时对待New York和{{1}}。有什么方法可以查询我在同一组中得到它们。我猜测任何可以修剪尾随空格的东西。虽然找不到任何东西。感谢

1 个答案:

答案 0 :(得分:2)

理想情况是在索引文档之前清理字段。如果这不是一个选项,您仍然可以在事后使用(例如)update-by-query plugin ...

清理它们

或者,但是性能更差,使用terms聚合与script而不是field,就像这样:

...
"aggregations" : {
"city_name" : {
  "terms" : {
    "script" : "doc['cityname.raw'].value.trim()",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "script" : "doc['countryname.raw'].value.trim()",
         }
       }
     }
   }
 }
}

另一个解决方案是从not_analyzed更改为analyzed字符串,但创建一个自定义分析器,使用keyword分析器保留令牌(not_analyzed})使用trim token filter

{
  "settings": {
    "analysis": {
      "analyzer": {
        "trimmer": {
          "type": "custom",
          "filter": [ "trim" ],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "cityname": {
          "type": "string",
          "analyzer": "trimmer"
        },
        "countryname": {
          "type": "string",
          "analyzer": "trimmer"
        }
      }
    }
  }
}

如果您索引cityname: "New York City ",那么将要存储的令牌将被裁减为"New York City"