我正在尝试对关键字类型字段执行不区分大小写的聚合,但我在使其工作时遇到问题。
我到目前为止尝试的是添加一个名为"小写"的自定义分析器。它使用"关键字" tokenizer和" lowercase"过滤。然后我在映射中添加了一个名为" use_lowercase"对于我想要合作的领域。我想保留现有的"文本"和"关键字"字段组件,因为我可能想在字段中搜索术语。
这是索引定义,包括自定义分析器:
PUT authors
{
"settings": {
"analysis": {
"analyzer": {
"lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"famousbooks": {
"properties": {
"Author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"use_lowercase": {
"type": "text",
"analyzer": "lowercase"
}
}
}
}
}
}
}
现在我使用相同的作者添加2条记录,但具有不同的情况:
POST authors/famousbooks/1
{
"Book": "The Mysterious Affair at Styles",
"Year": 1920,
"Price": 5.92,
"Genre": "Crime Novel",
"Author": "Agatha Christie"
}
POST authors/famousbooks/2
{
"Book": "And Then There Were None",
"Year": 1939,
"Price": 6.99,
"Genre": "Mystery Novel",
"Author": "Agatha christie"
}
到目前为止一切顺利。现在,如果我根据作者进行术语聚合,
GET authors/famousbooks/_search
{
"size": 0,
"aggs": {
"authors-aggs": {
"terms": {
"field": "Author.use_lowercase"
}
}
}
}
我得到以下结果:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "authors",
"node": "yxcoq_eKRL2r6JGDkshjxg",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
},
"status": 400
}
所以在我看来,聚合认为搜索字段是文本而不是关键字,因此给了我fielddata警告。我认为ES足够复杂,可以认识到术语字段实际上是一个关键字(通过自定义分析器),因此可以聚合,但事实并非如此。
如果我将"fielddata":true
添加到Author的映射中,那么聚合工作正常,但是在设置此值时给出了高堆使用率的可怕警告时,我犹豫不决。
执行此类不敏感关键字聚合是否有最佳做法?我希望我可以在映射部分中说"type":"keyword", "filter":"lowercase"
,但这似乎不可用。
如果我走"fielddata":true
路线,感觉就像我必须使用太大的棒才能让它工作。任何有关这方面的帮助将不胜感激!
答案 0 :(得分:2)
原来解决方案是使用自定义规范化器而不是自定义分析器。
PUT authors
{
"settings": {
"analysis": {
"normalizer": {
"myLowercase": {
"type": "custom",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"famousbooks": {
"properties": {
"Author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"use_lowercase": {
"type": "keyword",
"normalizer": "myLowercase",
"ignore_above": 256
}
}
}
}
}
}
}
然后,这允许使用字段Author.use_lowercase
进行术语聚合而不会出现问题。
答案 1 :(得分:0)
你确实将use_lowercase定义为text:
"use_lowercase": {
"type": "text",
"analyzer": "lowercase"
}
尝试将其定义为type: keyword
- 它帮助我解决了与排序相似的问题。
答案 2 :(得分:0)
默认情况下,这似乎是不可能的(没有"lowercase"
规范化器),但是如果没有这个,则可以使用技巧-在不区分大小写的正则表达式匹配中转换字符串。
例如对于字符串"bar"
-不区分大小写的正则表达式为"[bB][aA][rR]"
我使用python帮助程序来完成此操作:
def case_insensitive_regex_from_string(v):
if not v:
return v
zip_obj = zip(itertools.cycle('['), v, v.swapcase(), itertools.cycle(']'))
return ''.join(''.join(x) for x in zip_obj)